
This time the network is the computer
Image credit: Nvidia
Nvidia is not shouting as loudly about it but a lot of its work is now focused on the plumbing.
At its autumn AI conference for developers (GTC), chipmaker Nvidia was still entertaining the idea that it would, eventually, acquire processor-designer Arm. At the spring event, that plan is barely even in the rear-view mirror.
The Cambridge-1 computer that the company said it would build in the UK as a cooperation between Nvidia and Arm is going ahead but now looks to be a prototype for a much larger very much Nvidia-focused machine that it plans to use as the blueprint for what founder and CEO Jensen Huang calls an “AI factory”.
Though the focus of Huang’s keynote at the spring GTC this week (21 March 2022) was on the replacement for the Ampere graphics processing unit (GPU) architecture, which is now being moved out to provide AI acceleration for robots and other embedded systems, it stands at the top of a larger strategy. You can read this event in two ways. One as reinforcing the idea that Nvidia is a computer company now that happens to ship chips to other computer companies but one that has taken on the design of the overall machine far more explicitly. Alternatively, you could look at it as a shift away from traditional approaches to computer architecture to the extent of treating Arm’s role as a sideshow: who needs Arm? The real action is nowhere near the processor.
Even for the Grace multichip module, named after computing pioneer Grace Hopper, the main attraction is not so much the central processing unit (CPU) and Hopper H100 GPU that can go into it when it ships, possibly more than a year from today. It is how they are connected together. In the meantime, the company expects to ship a lot of the Hopper H100 GPUs that will go into some of the Grace packages, and a bunch of them will end up in the Eos machine. It will use a similar architecture to the Cambridge-1, a modular design based on DGX pods that assemble multiple PCIe cards and network switches into a standard data-centre rack.
“We expect Eos to be the fastest AI computer in the world,” Huang said, adding that it would be used as a demonstrator and a reference design for customers to transfer into their own data centres. “We are standing up Eos now and will be online in a few months.
“Hopper is going to be a game-changer for mainstream systems as well,” he added, pointing to the way that the GPUs will handle interconnections differently.
For systems that need to distribute AI work over more than one GPU chip, despite its impressive on-paper speeds, PCI Express even at its fifth generation has become a bottleneck in Nvidia’s view. The problem is that in most of today’s systems it is really just a staging post for data moving onto an Ethernet network. Like the vendors who are using FPGAs to perform AI acceleration, Hopper does away with the need to use PCI Express for those transfers. Instead, the GPUs talk directly to Ethernet and the Ethernet controller in turn can just transfer data in and out of the GPU-owned memory without relying on a processor to laboriously copy it.
In a later breakout session focused on the Hopper GPU itself, principal GPU architect Michael Andersch said the company’s engineering realised there needed to be “a fundamental shift in how we build our machines. We needed to innovate not just inside the GPU but across the data centre.”
This is not a new observation. When HP Enterprise developed the concept for its Moonshot server, it was clear that a lot of the work needed would be in how different CPU, accelerator and memory modules would be wired up to each other and that it was becoming important to stop data moving around the machine unnecessarily. Ending data migration was not just good for performance but for energy consumption as well. Work at Stanford University by Professor Bill Dally, who is now chief scientist at Nvidia, showed that most of the energy that goes into the typical computation in today’s architectures is down to moving it in and out of memory.
In several important applications for AI where the model is way too big to fit onto one CPU and GPU combo, data needs to be copied to all of the cooperating machines and the updates shared between them: an all-to-all topology. In others, the data is shared but reduced so that one machine finally comes up with an update at the end of the process, which is then copied to all the machines to incorporate into their own slice of the model. Because the work needs to be done on CPUs or GPUs in most of today’s architectures, this involves a lot of traffic. What Nvidia plans to do with the Hopper generation and the DGX pods that incorporate the 80-billion-transistor chip is to move some of the processing into the network.
That in-network processing comes as the result of the 2019 acquisition of Israeli-American specialist Mellanox. The company developed silicon for network switches that include their own processors to manipulate data packets on their way through from one port to another. This made its way into the Bluefield devices, which along with devices from suppliers such as Fungible represent another category of processor to sit alongside CPUs and GPUs. This is the data processing unit (DPU), though network processing unit might have been a clearer term.
For the all-reduce algorithm, the DPUs in the network cards and switches run a protocol Mellanox devised, the Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) originally developed for Infiniband switches. This lets computers attached to the switches ask the network itself to take care of some of their processing, which includes those data-reduction operations. The result is that much of the data never has to go all the way to all the other GPU cards. A single engine running SHARP collects what it needs and broadcasts the answer to all the machines that asked for the result. Similar multicasting support means processors do not have to explicitly send shared data to each and every machine on a list, they just ask the switch to do the job.
Some 30 years ago, Sun Microsystems came up with the slogan “the network is the computer”. At that time, it was more of a dig at IBM where, for many enterprise systems, you had one big computer serving a bunch of dumb terminals. Sun’s proposal was based on the client-server model that put a lot more emphasis on doing as much work as possible on the remote machines, though it did to some extent ignore the way that many mainframes had their own I/O processors to prevent the core CPU from having to do everything. However, even in Sun’s conception, the network itself was still pretty dumb. This time around, the network really is, if not the computer itself, an integral part of it.
Sign up to the E&T News e-mail to get great stories like this delivered to your inbox every day.