Concern over attributes like energy consumption is set to radically change the way computer servers are designed – but will we at last be able to move away from the conventions that have dominated design for the last 20 years?
Twenty-five years ago last June, IBM launched a minicomputer at a point when that class of system began to succumb to the much cheaper technologies developed for the personal-computer industry. The AS/400 employed techniques that even by 2013 standards are unusual – these helped the machine to survive in an environment that has become highly homogenised. It is rare to find anything that goes into a computer that does not use processors based on Intel's decades-old x86 architecture together with ideas from the Unix family of operating systems.
The AS/400 was instead designed to accommodate any form of processor. Software written for the machine never gets compiled to the native machine code of its host processor, which now happens to be based on IBM's Power architecture. And its operating system hid details from the software of what was in memory and what was stored on disk, in stark contrast to the way that Unix and similar operating systems, such as Microsoft Windows, work.
The notion of a single server architecture is, however, beginning to fade as data-centre users focus on squeezing more performance-per-watt out of their systems. Although they employ a Linux core, the servers operated by companies such as Google have been heavily customised. Also many server users are questioning whether reliance on one type of processor (no matter how accomplished it is) is altogether wise – putting a much greater focus on how data moves through the system.
The consolidation on Linux in Internet applications, together with other portable open-source software such as the Apache web server, MySQL database engine, and easily-learned scripting languages as PHP, Perl, and Python – a combination known as the LAMP stack – is helping to open up options for new and innovative hardware designs. Unlike Windows, which dominated business computing for some 20 years, LAMP software is not tied to Intel's x86 instruction-set architecture (ISA).
"We don't need x86 ISA compatibility any more thanks to the LAMP stack," is the view of Hunter Scales, processor architect at chipmaker AMCC, speaking at the recent Design Automation Conference (DAC) in Austin, Texas, in June 2013.
Processor maker ARM's reputation for low-power processor designs implemented for the mobile-phone market has made it an attractive option for servers, where power consumption is becoming the biggest problem to overcome. As Bruno Michel, leader of the thermal packaging group at IBM Research, points out, "In the future energy cost will be higher than hardware cost".
Mike Filippo, processor architect at ARM, previously worked at AMD and Intel. "One of the processors I did before was a supercomputer chip that consumed 250W," he recalls. "I joined ARM and realised: 'Wow I'm two orders of magnitude off here'." But even processor architects do not think that ISA matters for long-term energy reductions – the differences are more about tradition and habit than anything. The savings will have to come from other places.
Dave Shippi, processor architect at Altera, which is integrating processors with customisable logic in its field-programmable gate arrays (FPGAs), concurs: "I've worked on PowerPC, ARM, and x86, and I've designed microprocessors for all these. The ISA just does not matter. The claim is that ARM is more efficient, but that's just not true. We designed x86 microprocessors at AMD and went head-to-head with ARM. [ARM] says that you've got to decode those complex instructions on x86 – but that's not the full picture. There is a load-store RISC architecture inside all these processors. They are all efficient RISC architectures."
ARM became a potential for server designs when Dell found that it needed to re-examine how it put machines together in 2007. Robert Hormuth, an architect from the office of the CTO at Dell, says: "The inflection we saw was that workloads did not match what we were putting under the hood."
The workloads that servers process can be very different from each other, Hormuth explains: "Many workloads are now not compute-intensive, they are data-intensive. One example is search. It's not about computing the next digit of Pi, but about swapping data and looking through hashes in memory. There are still compute-intensive applications in sectors such as oil and gas. And then there is simple data movement, such as a web server. Is it compute intensive? Not at all. It's more about getting something off a disk. It's no longer one-size-fits-all."
Hormuth adds: "So if I want a performance improvement where do I go? The key difference between these applications is how they handle memory and disk storage' CPUs matter, but the CPU is not the most critical element on anyone's list any more."
So-called 'scale-out applications' – big-data software used to handle search requests at Google and social-network queries at Facebook among others – are doing the most to demonstrate the limitations of current processor-centric designs. These applications have a massive need for memory, and are usually designed to share the burden across many server blades. The program code is relatively small, but the amount of data they can access is extremely large. Not only that, once a piece of data has been analysed, it is unlikely to be needed again for some time. This is different to many conventional desktop and server applications, which use relatively small but fast caches to store data temporarily so that it can be reused more easily if required.
At École Polytechnique Fédérale de Lausanne in Switzerland, a team simulated a number of server designs with typical scale-out code and found that, far from helping, data caches get in the way and cause far more bytes to be copied from memory to cache than are ever used. When first analysing server performance while working at Microsoft, Christos Kozyrakis, associate professor of Electrical Engineering & Computer Science at Stanford University, thought the memory usage of scale-out applications was an experimental error. "The CPU utilisation was high, but memory bandwidth was extremely low," he reports. "At first, I thought it was a mistake."
The reality is that scale-out servers do not need the bandwidth. Hormuth explains what often happens using the example of 'linked-list chasing'. A commonly-used data structure in computing, the linked list is an object that contains a small piece of data and a pointer to the next item on the list. To find what it wants, the application needs to follow this chain of pointers.
To pull that data into a register to check it, the latest computers will fetch not just the 32-bit or 64-bit 'word' that contains the pointer, but an entire cache line of eight or more words. Caches do this because, for more compute-intensive applications, there is a strong likelihood of nearby data being used; but for the linked lists and other similar data structures used in search and social-networking applications, "I throw away seven-eighths of my bandwidth to get that one word," according to Hormuth.
Turning off the cache and simply loading one word at a time direct from memory would be as effective and massively reduce the needed bandwidth – assuming that other parts of the application cannot take advantage of the cache. "It turns out that there is a very good technology available now, which is LPDDR2," says Christos Kozyrakis at Stanford University, pointing to the memory now used in current mobile phones and tablet PCs.
Simulating an LPDDR2-based server against a conventional design, Kozyrakis found no appreciable difference in performance – "You didn't need the bandwidth to begin with" – but the energy, as found with mobile phones, should be much lower. The problem now is that memory makers do not make high-density versions of these memories, making them bulky alternatives to 'server-grade memory'.
Data trafficking changes
ARM's Filippo warns that making big changes to cache architecture on the basis of certain applications is potentially dangerous: "The hard part is characterising your workloads effectively. If you decide on a certain cache structure, you'd better be right," he says.
Scales says some mobile processors offer some control over cache behaviour, but that this focuses on relatively simple changes, such as how the cache decides to discard data it thinks it no longer needs. Changing the way it fetches data dynamically is in fact difficult to do.
One option may be to move the more difficult, data-hungry workloads out to other hardware. "We can either bring the data to the computer, or we bring the computer closer to the data," says Altera's Dave Shippi. "People have tried to do that for a number of years. People tried to build DRAM into CPUs, or build little processors in DRAM. This is where the FPGA can get interesting," Shippi claims, pointing to the way in which custom processors built out of programmable logic can bypass caches and other support mechanisms that standard processors use. The FPGA can also implement data-fetching strategies that are tuned for specific algorithms.
"We are evolving towards a heterogeneous environment," says Shippi. "All those workloads will require different processing models. Some applications map nicely onto tiny digital signal processors. There are some that map onto a CPU. And there are others where you will roll your own secret sauce with an accelerator. The heterogeneous environment is going to give us the flexibility to attack different workloads."
The problem meanwhile for the FPGA is that hardware design involves different languages and thought processes to software programming. As a result, only a few specialised applications have used FPGAs for computing, and often in areas such as video processing where the performance advantage over standard processors is immense. "Luckily, we have technology now called OpenCL that lets us efficiently represent parallel algorithms in software," Shippi avers.
Beyond the CPU
As the CPU fades in significance, other parts of the machine are coming under scrutiny. The next steps will focus on the way in which data moves between memory and hard drives, which are increasingly being replaced by flash memory.
Because 64-bit systems can address massive amounts of data directly, the software architecture of these machines is likely to mirror concepts from IBM's aforementioned AS/400 of 1988, which hid the disk drives from the programmer so that the operating system could manage data delivery more efficiently.
"We're looking at enterprise computing from the ground up. We're looking at everything in the system: DRAM; SSDs; networking; everything," reports Filippo. "It's an exciting time to be in cloud and enterprise computing in general."
Anil Sabbavarapu, hardware architect at Intel, is in agreement: "You have to consider the complete system. It's not the CPU alone, it's the integrated system".