Mozart processor on carrier board

Mozart conducts AI to keep the bytes flowing

Image credit: SimpleMachines

AI has a bloating problem. Some researchers think it can be fixed by rearranging the way computer processors fit together.

At 'Hot Chips' last week (the annual technological symposium held in Silicon Valley), researchers from the University of Wisconsin-Madison and startup SimpleMachines described what they see as a necessary change to the way computing hardware is put together. Like many things in computing at the higher end, the apparent driver for this change is familiar: artificial intelligence.

Saru Sankaralingam, computer-sciences professor at the University of Wisconsin-Madison, argued in his talk on the Mozart architecture that AI models are getting bloated. If you look at things like language models, models are now outpacing Moore’s Law by a factor of ten. In that environment, it’s no surprise to find waferscale processors like those from Cerebras turning up at the same conference. There is one reason why what Stanford University’s HAI calls foundation models are so big: they work better than smaller models. Right now, there is not a strong incentive to try to optimise them for efficiency, although it is pretty clear there is a lot of redundancy in the neural connections. Some experiments on the now quite small BERT architecture showed that you can cheerfully saw huge chunks out of a trained model practically at random and it will still deliver acceptable results.

However, models are growing for another reason, says Sankaralingam: the architectures being run today happen to execute efficiently on the graphics processing units (GPUs) still mainly used to train and execute them. Other architectures could deliver good results but use far fewer teraflops, he argues. The downside? They are not a great fit for GPUs or the other types of accelerator used for these applications. There are optimisations that greatly reduce the number of calculations even within the convolutional layers that kickstarted the deep-learning revolution. One technique that breaks down the matrix multiplications into smaller chunks known as depthwise convolution gets used a fair amount for inferencing in embedded systems because it can deliver the same results for up to ten times fewer operations. Unfortunately, according to the UW-Madison researchers, it does not run any faster on GPUs because the overhead of getting data on and off the chip dominates the performance equation compared to just running normal convolutions.

Having to shuffle data around is the great obstacle to efficient machine learning. A series of papers have underlined how much energy it takes to look up data in memory, fetch and replace it. Actually running a multiplication, even at high speed, is far less power hungry. And it takes time. Researchers have talked about the memory wall for decades and if anything the wall has grown taller since it first appeared. The problem is that computing largely boils down to the following: decide where to get the data; decide what to do with it, and decide where to put the results when that’s been done.

Traditionally, all these functions were combined in the dominant processor architecture: the von Neumann machine. But it was designed for an earlier age, when transistors were precious and memory was at least as fast as the logic gates in the instruction pipeline. As a result, the idea of running several instruction to grab data from memory and dump those elements into local registers, do something with them and then put the results in place with another instruction made a lot of sense.

Now, the compute engines are often starving for fresh data. The roadblock is getting the data in and out. This is where UW-Madison’s Mozart comes in. It breaks the computer down to match those three phases, with two others – synchronisation to prevent threads corrupting each others’ data stores and control such as branching – used to complete the set of operations needed for a full computer processor. In fact, the control is handled by a regular microprocessor as von Neumann machines are perfectly good at that kind of thing, as billions of embedded microcontrollers have demonstrated over the years.

Mozart looks sensible and it’s at this point, you wonder, "Why aren’t computers designed like this already?" In fact, some already are. This one is an attempt to formalise something that has already evolved in machine-learning and signal-processing circles. The core computing area is a coarse-grained reconfigurable architecture: basically a bunch of execution units that feed data to each other using programmable interconnect. For decades, in applications such as radar processing, field-programming gate arrays (FPGAs) made by Intel PSG and Xilinx have been doing that kind of job. Google’s Tensorflow ASIC is based on a systolic array that has a fairly fixed forwarding network.

One difference between the FPGA and hardwired systolic-array architecture is that they work brilliantly on dense matrices but utilisation drops like a stone when faced with sparse structures, which GPUs also find problematic. This is where the Mozart data-gathering engine comes in. It looks ahead into the sequence of operations, reads the data and reorganises it into a pattern that will flow nicely through the CGRA. This also is not a radical departure. Signal-processing specialists such as Ceva have used scatter-gather memory controllers to feed their own highly parallelised execution pipelines, though these do not yet employ a CGRA structure. A major downside of the CGRA is that it is not all that hardware efficient: programmable interconnect tends to be expensive. There is one other obstacle for the FPGA-like architectures: they are tricky to program, because although Xilinx has worked hard on making the tools more programmer-friendly, the environment is quite hardware-centric.

This is where the Mozart team believe they may have an advantage, by developing a software stack that works directly on the model source code and which generates not so much a file full of instructions but a list of streams that map more neatly onto the data-gathering engines. “Program synthesis and the auto-generation of the software stack is essential for future chips,” says Sankaralingam. “The compiler looks at the semantics of the program and breaks it up into the four broad classes of activity.”

The theory is that the approach the team has taken will make it easier to take new models developed with readily accessible languages and libraries and have them generate efficient programs for something like Mozart without the need for hand-tuning that tends to force developers in AI to use tried and tested ready-made kernels. In principle, the decomposition process used by the compiler could map onto more conventional architectures, but the team believes its hardware is more amenable to this approach than most.

Although it performs worse than an nVidia A100 on one of the common DNN structures, the 16nm Mozart implementation looks to hold its own on a variety of model types compared to the commercial device, which is made on a process node that is more advanced by a couple of generations. A forthcoming implementation aimed at the 7nm process node should deliver higher performance.

In practice, many of the bits that comprise Mozart are already in use today but a reformulation over the core attributes of the computer similar to those used by the UW-Madison team coupled with a continued drive for better AI may finally break up the von Neumann machine.

Sign up to the E&T News e-mail to get great stories like this delivered to your inbox every day.

Recent articles