Multicore processing is here to stay. But it means developers have to look at new ways of coding to take advantage of the performance on offer.
For Krisztián Flautner, ARM's director of R&D, the future of hardware is soft. "People are taking designs that once took lots of gates. Now they want to do them with lots of software," he claimed. "People would like to build fewer chips that they use for more applications. Look at cellphones. Very small changes to the standards have caused companies to spend a lot of money on redesigns.
"The problem is that flexibility and software are seen as being inefficient," Flautner added.
A second issue is that the rapid increases to flexibility that the industry saw in the late 1990s have all but stopped. Although chipmakers could push up clock speed to take full advantage of denser process technologies, they cannot turn that knob too far. It would push the power consumption up far quicker. The only way to increase performance without increasing clock speed is to distribute the workload across more processor cores. But nobody today has the answer to how you take a conventional program and parallelise it to run in that way. The problem is that most software programming languages assume a single thread of control.
There are two ways around the problem: extend one of today's popular languages or try to move to a different language that has all the parallelism built-in already. The big problem with the former approach is that, if you were going to implement a parallel language, you would not start with C or C++. But the alternative is even worse: there is very little appetite to move to a different language although some teams, particularly those working at the boundary between software and hardware, have found that there is mileage in throwing out C++ and using something completely different.
Mikko Terho, vice president at Nokia, said the phone maker's 'lablet' at the Massachusetts Institute of Technology had had some success with the Bluespec language for combined hardware and software design, where uncovering parallelism is essential.
"We found that in a number of cores, the code in the new language is so much smaller than the C code that you can have much smaller teams. If you select proper tools, you can do it with so much less manpower," claimed Terho.
At the research level, there is no shortage of new languages. At the IET FPGA Developer's Forum in London late last year, Satnam Singh, a researcher at Microsoft's Cambridge laboratories, reeled off a long list of obscure and often exotic languages that have been developed to deal with the problem of parallelising software. The advantage of a new language, he said, is that you can help people get to grips with multicore development if you stop using languages that encourage you to think sequentially.
"We do know how to program multicore processors for certain tasks. We just can't take Word and do that in parallel. I think that is because we have the wrong assumptions. Because we start with languages that are based on the idea of taking a value from this box and putting it in that box. If we start with languages that don't have these bad assumptions we can get there," Singh explained.
"If you get a book on parallel programming, it will tell you about threads and locks and monitors. But they are the wrong abstractions for writing parallel programs. It's like shaving with a chainsaw. You want to write with composable elements."
However, it does not mean you have to rule out work with conventional languages. Singh described an experiment at Microsoft where support for automatic locks were added to a conventional programming language. While the locks are in force, he said, "you don't have to worry about other threads". He added: "There is some machinery at some expense that makes it happen. But you don't have to worry about it. For concurrent programming, this is the best kind of weapon we have." Cilk may point the way forward because it is a superset of standard C.
For its work on parallelisation, Codeplay decided to take the approach of extending C++, working on the basis that, if you can deal with the 'side effects' of parallelism efficiently, you can get significant speedups. The company's Sieve C++ language was designed for the gaming industry – an area where programmers are keen to eke out whatever performance they can get out of today's multicore systems. Not only are the host processors now multiprocessors, the graphics adapter themselves are massively parallel proces‑ sors. So much so that graphics chip nVidia is offering some versions of its chips as accelerators for scientific computing. Codeplay is also doing work on IBM's Cell processor, which is used in the Playstation 3.
The biggest obstacle to scaling in the Codeplay system, said Donaldson, is cache behaviour. Operations such as matrix multiplies and noise reduction in images saw near linear speedups in moving from one core to eight. But the fast Fourier transform (FFT) saw practically no speed-up at all. "That was probably because of poor use of cache lines," he explained.
Flautner said there is a potential issue with environments that assume the processing cores are homogeneous.
"We need to consider how much to move into domain-specific architectures rather than having a generic C engine," said Flautner. "The vision of the future as some people express it in the post-frequency scaling environment is the move from one to two to four to eight to thirty-two all the way to a billion cores. I wondered what these guys are smoking. Will it be another trend where you just turn the knob? And live on it for 20 years?" Flautner asked.
Flautner noted that in most embedded multiprocessors, "each generation you tune the memory and processors. You end up with ten or so cores, each with its own specialisation. Is that going to change?".
In the case of a Conexant home-gateway processor that he used as an example, all of the processor cores were based on the ARM architecture. But they were different members of the family. Each core in the SoC may have a different focus and use different coprocessors or extensions.
"In the mobile phone, the SoCs look quite lumpy but they have been around for some time. These people care about efficiency and they have come up with a way of evolving the architectures that is quite unlike the server domain," Flautner explained. "It has been like that for a while and it's not going to change."
The problem for developers is that 'lumpy' multiprocessors are tougher to deal with automatically. As the parallelising languages tend to prefer homogeneous architectures, embedded engineers looking to maximise their use of multiprocessor SoCs are likely to have to continue to using a mixture of environments. The host application may split neatly across a four- or an eight-core symmetric multiprocessor and programmed using an extended form of C++. But other parts of the system will be heavily customised. These are where developers may need to explore new languages as traditional approaches run out of steam.
Running on the off chance
Codeplay's main addition to C++ is the ‘sieve' block. Inside the block, any writes back to memory do not happen immediately but are put into a queue. The data locations do not get updated until the program leaves the block - a runtime engine then works through the writes. Dealing with writes can dramatically slow down automatically parallelised software because you have to keep checking for conflicts between software running on different processors.
What Sieve C++ allows you to do, explained Alastair Donaldson at the Fourth Workshop on Compilers and Architectures in Cambridge, is run code speculatively. There is a chance you will waste some processor cycles but it allows you to spread a loop across many processors.
Donaldson demonstrated the technique with a vector division operation: code that runs thousands of division operations on an array of more than a thousand elements. The trouble is, you do not know how many elements there will be in the array at compile time. Say you expect to work with around 4,000 elements, you might have the algorithm split into four, each running on a different processor core, taking 1,000 elements apiece. If you only have 1,800 elements in the actual calculation, the work performed by two of the cores will be entirely redundant. However, the language supports a special class of variable that lets you hint to the runtime engine how many elements there are likely to be.
With Sieve C++, said Donald‑ son, you do not have to encode into the software how many cores the target system will have. The runtime engine knows that, using that information to distribute functions. It means that code can be written for a particular platform, such as the x86, and moved from an eight-core to a sixteen-core to a four-core machine without changes.
|To start a discussion topic about this article, please log in or register.|
"Is augmented reality the next big thing or a marketing gimmick? Is it fundamental to the future or a fashion faux pas?"
- Fukushima Daiichi Unit 3 5th Floor Highly Radioactive Debris [03:09 pm 17/05/13]
- Cluster formation on cooja simulator [01:59 pm 17/05/13]
- DSLAM Power Consumption [01:58 pm 17/05/13]
- English is not my first language. [01:23 am 17/05/13]
- Transport 2020 [09:35 pm 16/05/13]
Tune into our latest podcast