Re-building the server - microprocessors to memory moves
It’s time to change the way we think about PC server design, the innovators say
IBM’s AS/400, launched in 1988, was designed to accommodate any form of processor
Dell and other server vendors are considering options for future innovation across a range of internal components
“There is not a lot of ‘room at the bottom’ left now - but there is a lot of room between components”
Concern over attributes like energy consumption is set to radically change the way computer servers are designed – but will we at last be able to move away from the conventions that have dominated design for the last 20 years?
Twenty-five years ago last June, IBM launched a minicomputer at a point when that class of system began to succumb to the much cheaper technologies developed for the personal-computer industry. The AS/400 employed techniques that even by 2013 standards are unusual – these helped the machine to survive in an environment that has become highly homogenised. It is rare to find anything that goes into a computer that does not use processors based on Intel's decades-old x86 architecture together with ideas from the Unix family of operating systems.
The AS/400 was instead designed to accommodate any form of processor. Software written for the machine never gets compiled to the native machine code of its host processor, which now happens to be based on IBM's Power architecture. And its operating system hid details from the software of what was in memory and what was stored on disk, in stark contrast to the way that Unix and similar operating systems, such as Microsoft Windows, work.
The notion of a single server architecture is, however, beginning to fade as data-centre users focus on squeezing more performance-per-watt out of their systems. Although they employ a Linux core, the servers operated by companies such as Google have been heavily customised. Also many server users are questioning whether reliance on one type of processor (no matter how accomplished it is) is altogether wise – putting a much greater focus on how data moves through the system.
The consolidation on Linux in Internet applications, together with other portable open-source software such as the Apache web server, MySQL database engine, and easily-learned scripting languages as PHP, Perl, and Python – a combination known as the LAMP stack – is helping to open up options for new and innovative hardware designs. Unlike Windows, which dominated business computing for some 20 years, LAMP software is not tied to Intel's x86 instruction-set architecture (ISA).
"We don't need x86 ISA compatibility any more thanks to the LAMP stack," is the view of Hunter Scales, processor architect at chipmaker AMCC, speaking at the recent Design Automation Conference (DAC) in Austin, Texas, in June 2013.
Processor maker ARM's reputation for low-power processor designs implemented for the mobile-phone market has made it an attractive option for servers, where power consumption is becoming the biggest problem to overcome. As Bruno Michel, leader of the thermal packaging group at IBM Research, points out, "In the future energy cost will be higher than hardware cost".
Mike Filippo, processor architect at ARM, previously worked at AMD and Intel. "One of the processors I did before was a supercomputer chip that consumed 250W," he recalls. "I joined ARM and realised: 'Wow I'm two orders of magnitude off here'." But even processor architects do not think that ISA matters for long-term energy reductions – the differences are more about tradition and habit than anything. The savings will have to come from other places.
Dave Shippi, processor architect at Altera, which is integrating processors with customisable logic in its field-programmable gate arrays (FPGAs), concurs: "I've worked on PowerPC, ARM, and x86, and I've designed microprocessors for all these. The ISA just does not matter. The claim is that ARM is more efficient, but that's just not true. We designed x86 microprocessors at AMD and went head-to-head with ARM. [ARM] says that you've got to decode those complex instructions on x86 – but that's not the full picture. There is a load-store RISC architecture inside all these processors. They are all efficient RISC architectures."
ARM became a potential for server designs when Dell found that it needed to re-examine how it put machines together in 2007. Robert Hormuth, an architect from the office of the CTO at Dell, says: "The inflection we saw was that workloads did not match what we were putting under the hood."
The workloads that servers process can be very different from each other, Hormuth explains: "Many workloads are now not compute-intensive, they are data-intensive. One example is search. It's not about computing the next digit of Pi, but about swapping data and looking through hashes in memory. There are still compute-intensive applications in sectors such as oil and gas. And then there is simple data movement, such as a web server. Is it compute intensive? Not at all. It's more about getting something off a disk. It's no longer one-size-fits-all."
Hormuth adds: "So if I want a performance improvement where do I go? The key difference between these applications is how they handle memory and disk storage' CPUs matter, but the CPU is not the most critical element on anyone's list any more."
So-called 'scale-out applications' – big-data software used to handle search requests at Google and social-network queries at Facebook among others – are doing the most to demonstrate the limitations of current processor-centric designs. These applications have a massive need for memory, and are usually designed to share the burden across many server blades. The program code is relatively small, but the amount of data they can access is extremely large. Not only that, once a piece of data has been analysed, it is unlikely to be needed again for some time. This is different to many conventional desktop and server applications, which use relatively small but fast caches to store data temporarily so that it can be reused more easily if required.
At École Polytechnique Fédérale de Lausanne in Switzerland, a team simulated a number of server designs with typical scale-out code and found that, far from helping, data caches get in the way and cause far more bytes to be copied from memory to cache than are ever used. When first analysing server performance while working at Microsoft, Christos Kozyrakis, associate professor of Electrical Engineering & Computer Science at Stanford University, thought the memory usage of scale-out applications was an experimental error. "The CPU utilisation was high, but memory bandwidth was extremely low," he reports. "At first, I thought it was a mistake."
The reality is that scale-out servers do not need the bandwidth. Hormuth explains what often happens using the example of 'linked-list chasing'. A commonly-used data structure in computing, the linked list is an object that contains a small piece of data and a pointer to the next item on the list. To find what it wants, the application needs to follow this chain of pointers.
To pull that data into a register to check it, the latest computers will fetch not just the 32-bit or 64-bit 'word' that contains the pointer, but an entire cache line of eight or more words. Caches do this because, for more compute-intensive applications, there is a strong likelihood of nearby data being used; but for the linked lists and other similar data structures used in search and social-networking applications, "I throw away seven-eighths of my bandwidth to get that one word," according to Hormuth.
Turning off the cache and simply loading one word at a time direct from memory would be as effective and massively reduce the needed bandwidth – assuming that other parts of the application cannot take advantage of the cache. "It turns out that there is a very good technology available now, which is LPDDR2," says Christos Kozyrakis at Stanford University, pointing to the memory now used in current mobile phones and tablet PCs.
Simulating an LPDDR2-based server against a conventional design, Kozyrakis found no appreciable difference in performance – "You didn't need the bandwidth to begin with" – but the energy, as found with mobile phones, should be much lower. The problem now is that memory makers do not make high-density versions of these memories, making them bulky alternatives to 'server-grade memory'.
Data trafficking changes
ARM's Filippo warns that making big changes to cache architecture on the basis of certain applications is potentially dangerous: "The hard part is characterising your workloads effectively. If you decide on a certain cache structure, you'd better be right," he says.
Scales says some mobile processors offer some control over cache behaviour, but that this focuses on relatively simple changes, such as how the cache decides to discard data it thinks it no longer needs. Changing the way it fetches data dynamically is in fact difficult to do.
One option may be to move the more difficult, data-hungry workloads out to other hardware. "We can either bring the data to the computer, or we bring the computer closer to the data," says Altera's Dave Shippi. "People have tried to do that for a number of years. People tried to build DRAM into CPUs, or build little processors in DRAM. This is where the FPGA can get interesting," Shippi claims, pointing to the way in which custom processors built out of programmable logic can bypass caches and other support mechanisms that standard processors use. The FPGA can also implement data-fetching strategies that are tuned for specific algorithms.
"We are evolving towards a heterogeneous environment," says Shippi. "All those workloads will require different processing models. Some applications map nicely onto tiny digital signal processors. There are some that map onto a CPU. And there are others where you will roll your own secret sauce with an accelerator. The heterogeneous environment is going to give us the flexibility to attack different workloads."
The problem meanwhile for the FPGA is that hardware design involves different languages and thought processes to software programming. As a result, only a few specialised applications have used FPGAs for computing, and often in areas such as video processing where the performance advantage over standard processors is immense. "Luckily, we have technology now called OpenCL that lets us efficiently represent parallel algorithms in software," Shippi avers.
Beyond the CPU
As the CPU fades in significance, other parts of the machine are coming under scrutiny. The next steps will focus on the way in which data moves between memory and hard drives, which are increasingly being replaced by flash memory.
Because 64-bit systems can address massive amounts of data directly, the software architecture of these machines is likely to mirror concepts from IBM's aforementioned AS/400 of 1988, which hid the disk drives from the programmer so that the operating system could manage data delivery more efficiently.
"We're looking at enterprise computing from the ground up. We're looking at everything in the system: DRAM; SSDs; networking; everything," reports Filippo. "It's an exciting time to be in cloud and enterprise computing in general."
Anil Sabbavarapu, hardware architect at Intel, is in agreement: "You have to consider the complete system. It's not the CPU alone, it's the integrated system".
Harnessing natural forces: Reassigning the server's body parts for the next generation
The future will bring changes that change the way electronic computer systems are put together – particularly in respect to replacing wires with non-contact technologies.
IBM's centenary in 2011 saw its engineers think of architectures that are radically different to what we see today. And the first steps towards them are already being taken by researchers at institutions such as the Technical University of Dresden, where the aim is to pack the performance of a standard supercomputer into a 10cm cube.
Bruno Michel, leader of the thermal packaging group at IBM Research, explained at the DATE 2013 conference in Grenoble last March: "We were asked to look 100 years into the future. We said 'no way', but that 30 to 50 years was possible... In the past, we could rely on transistors to become smaller. The theoretical physicist Richard Feynman said, 50 years ago, that there is 'plenty of room at the bottom' – but there is not a lot of room at the bottom left now. But there is a lot of room between components."
Chip stack knack
Michel added: "If you compare a biological brain with a computer, the brain is 10,000 times more efficient. The biological brain shows how to do volumetric density scaling. Let's take the brain and copy the packaging aspects of the brain – but we don't want to copy the operation. That's too complicated."
The chipmaking industry has started work on improving chip density by stacking them on top of each other. The Memory Cube concept being developed by Samsung and other memory makers is one example of a technique that uses vertical connections drilled through each chip to allow them to be plugged together directly.
Work by Intel and others has shown that bringing processors and memory closer together, which cuts parasitic capacitance and inductance, can slash the power needed to transfer data between them. At'100GB/s, it takes roughly 25 picojoules (pJ) to transfer each bit of data when chips are mounted conventionally on a PCB. Stacking the chips can cut this energy to just 1pJ/bit.
"But we also have the issue of heat dissipation," Michel cautioned. Processors and other devices that run at high temperature normally lose heat through the PCB and heatsinks attached to the top of the package. If other heat-producing chips are in the way, the devices run the risk of overheating and failing. One way around this is to add more connections into the chip stack that are used to pass liquid coolant – a return to the water-cooling techniques of old mainframes but on a microscopic scale.
"There is a third problem – power," says Michel. "In a processor, three-quarters of the wires into the device are used for electrical power and ground connections."
IBM's proposal is to stop using wires to deliver energy to the chip. "The solution? We call it 'electrical blood'," says Michel. "Our brain doesn't use copper wires to supply energy, it uses blood. Doing this we can free up surface area for communication."
Rather than deliver electrical energy directly, the 'electrical blood' proposed by IBM is an electrolyte solution that delivers chemicals to miniature fuel cells mounted on each chip.'Chemical reactions generate the electrical energy and the waste products are carried away through the computer's 'bloodstream', which will also help direct excess heat away from the chips. The hot fluid could be used to provide heat to other systems, and recover some of the energy cost of running the computer, Michel suggested. He claims the combination of techniques could shrink a computer with the power of SuperMUC – a 3-petaflops-per-second supercomputer based in Munich that covers an area of 500m2 – into a 10-litre cube.
There is a limit to how big each chip stack can get even with 3D interconnect, cooling and electrical blood. One problem is getting data from one stack to another. Researchers at the Technical University of Dresden believe one answer is to move from boards plugged into backplanes – which tend to create a wiring bottleneck – to wireless communications.
Free-space optical communication is one possibility but steering the light is tricky. Instead, millimetre-wave communication, similar to techniques proposed for 5G cellular networks, offers a viable option. Because the distances are very short, comparatively little power is needed to deliver data at 10Gb/s or more. By dropping much of the point-to-point wiring completely, the cubic computer could transcend the biological brain.
Architectures to come: the forward-looking rack-mounted server
Voltages used to supply power to each unit will increase to reduce losses. Local power converters in each processor-memory complex should be able to reduce the voltage to the sub-1V levels needed by logic and memory devices.
Processors and memory
Already positioned close to each other to allow high speeds at relatively low power-consumption, these components could be stacked on top of each other to minimise the energy-sapping capacitance of PCB-based buses. The stacking would increase density allowing multiple processor-memory complexes to be mounted within one rack unit.
To carry away and reuse the heat generated by powered internal components, air cooling may be replaced by heat pipes carrying liquid running directly over or underneath the processors and memory. To increase the efficiency of heat transfer, components are likely to be allowed to run hotter than they are now, although some more cautious commentators have claimed that this might affect reliability.
Because of the increasing number of individual processors that will be running within each rack unit, the traditional Ethernet port could become a network router in its own right, providing high-speed optical connections running either multi-gigabit Ethernet or a serial PCI-Express protocol similar to the I/O Thunderbolt interconnect technology.
As multiple processors are brought into close proximity, internal I/O will be able to make increasing use of gigahertz-frequency wireless communication and point-to-point optical employed for communications between server units within a rack.
Hard-disk drives could be replaced by flash or similar non-volatile for faster access times and long-term lower energy consumption. They are held at the front for easy upgrades; this is likely to remain the case.
|To start a discussion topic about this article, please log in or register.|
"Our summer watersports special: surfing artificial waves, racing yachts for sport, superyachts for pleasure and much more besides"
- Automakers sued over 'dangerous' keyless ignitions
- Smart 3D printed micro-fish could improve detoxification
- Japan sweetens high-speed rail offer to Indonesia
- Self-healing polymer could protect future spacecraft against meteorites
- Key component of Hubble successor arrives for assembly
- Girls as young as seven put off engineering