Brain surgery for deep learning
Neural networks are likely to be in charge of cars. What can we do about their raging thirst for compute power?
The scramble to create a self-driving car is underway. The question is what kind of software will control it. The current hot favourite for giving a computer some sense of understanding of what is in front of it on the road is deep learning – a second attempt to take neural networks mainstream that has been made possible by a massive increase in compute power since artificial neural networks first hit the R&D community.
The graphics chipmaker nVidia has done pretty well out of the revolution so far because its GPUs happen to run deep-learning applications faster than conventional processors. However, even these are bumping up against limits. The number of calculations needed to build an effective deep-learning system for any class of image is anywhere between 1016 and 1022 multiply-and-add operations – or MACs for short. Even something running at a gigahertz at a rate of a few multiplies a second is going to need some time to crunch through that lot. The GPU overcomes this to some extent by doing so much in parallel. However, even that has limits: eventually you simply run out of memory bandwidth to let you move data in and out of the array.
One piece of good news is that self-driving car computers do not have to do their own training. That can, for the most part, be left to servers in the cloud able to host multiple GPUs. However, even inferencing – the process of running new images through trained networks – chews up memory and compute power. According to Samer Hijazi, senior architect at Cadence Design Systems, that is more like a billion to a trillion MACs. This is achievable but demands pretty power-hungry processors and memory bandwidth that "is not feasible for an embedded device," he adds.
One option is to develop processors that are tuned more to the needs of neural networks than today's GPUs and digital signal processors (DSPs). However, the memory usage remains a significant obstacle. At the Design Automation Conference (DAC) in Austin in June, Professor Mark Horowitz of Stanford University argued the design of many systems is "all about the memory" because, unless something changes, dealing with the bandwidths needed will be prohibitive not just in terms of cost but power.
Work by Horowitz's team and other researchers has indicated that the neural network is a class of application that makes it possible to use extensive caching to reduce the impact of memory accesses. Horowitz argues that, if you started with a blank sheet of paper, you probably would not go for a GPU architecture for applications such as deep learning. What is needed is some sort of programmable pipeline that steers data through a forest of processors so that the execution units do not have to access memory all that much. Much of the energy in memory accesses is consumed by the amount of charge that needs to be fed into their large arrays. Keeping everything local as long as possible helps massively with power consumption.
Horowitz and other researchers are proposing approaches such as the coarse-grained reconfigurable architecture (CGRA), which lets you stitch together execution units, local memory and pipeline registers for short-term storage in a way that is tuned for the specific algorithm. It might even be tuned to a specific shape of neural network. Another option that involves less deep-level reworking is to make the processor less important compared to the memory. Cadence and competitors like Ceva have added much smarter memory management units (MMUs) to their parallel processors that pulls in data and organises it in local memory for more efficient processing.
The extreme option is to make memory the centre of attention and distribute lots of execution units throughout the array to minimise the distance that data has to travel. Micron Technology has started down that road with the Automata architecture. By turning computer architecture on its head it also turns programming on its head. Automata code bears little resemblance to conventional software, which makes it hard to adopt.
There is another way and it's one that nVidia is working on with chief scientist and Stanford professor Bill Dally. This work prunes the neural network after training to remove connections that make little difference to the results.
Cadence has been doing its own version of pruning, looking at ways to slim down the network during training as well as trimming the network once training has completed. "We created our own network architecture and methodology for analysing the data before we start. Create a network that fits within a continuum of complexity and performance.
Train the network down. At a certain point we can no longer slim it down," Hijazi says. By that time the number of calculations has been reduced significantly.
Another route neural network designers have taken is to look at mathematical precision. Google's custom TensorFlow chip and supporting language can work with data as narrow as 8bit and with fixed-point rather than the more flexible floating-point format.
"The consensus at the start was that floating-point arithmetic was the way to go," says Hijazi. "The consensus has largely shifted to fixed-point for inference and even to push down to 8bit. With advanced quantisation techniques we can go below even 8bit representation, using 4bit values and even 4x4bit calculations with minimal degradation in performance. That dramatically reduces bandwidth." By focusing on network optimisations, Hijazi reckons the changes in hardware architecture can be relatively small and not demand a switchover to the in-memory computer. "It was a nice idea but as we look more into this problem, we see we can reduce the bandwidth needed to the point where it becomes feasible."