Greedy tech gives resource problems
Image credit: Graphcore
AI loves computers and the more of them the better. But resources – silicon and energy – are finite, so the dash for growth is going to have to end soon.
No one seems to have told the AI community about the silicon shortage that has caused memory and graphics card prices to shoot up while car manufacturers struggle to find supplies. Because, based on current trends, there is nothing like a neural network for chewing up silicon.
Take Cerebras Systems as an example. The company is now on its second generation of AI processor using a design that consumes more or less a full wafer of silicon. In this second generation, the processor relies on a second external unit to feed it the data it needs. Cerebras today lies at the extreme end of the silicon-area scale, but many of the start-ups and systems companies making accelerators for AI have taken the view that they need to make them as big as they can.
Simon Knowles, chief technology officer at Bristol-based Graphcore, explained at the Hot Chips 33 conference in late August that his company’s mark-two processor is “as big as a reticle allows”, alluding to the maximum area the lithographic equipment used to print features on a chip can illuminate at once. To get to its wafer-size product, Cerebras worked with foundry TSMC to develop a way to connect together the circuits lying inside each reticle area.
For many applications, cost is a major factor in determining the viability of a chip, at least if it is going to ship in volume. And that cost has a lot to do with die area – so much so that for consumer-level products such as smartphones, chipmakers tend to be keen to keep die area below a square centimetre.
Over at least four generations of product, Apple’s A series of processors have kept to just below that magic number – and it is one that has not changed much in several decades. The result? You could get 250 Apple A14 processors out of the wafer area consumed by Cerebras’ core array.
In AI, suppliers are banking on their much more expensive designs working out cheaper at a system level than the arrays of GPUs and server blades they use today. It is a response to what has become accepted as inevitable in machine-learning circles. Carole-Jean Wu of Facebook says: “There are three industry trends that have fuelled deep learning: open algorithms, bigger and better data, and compute power.”
The silicon arms race got under way a decade ago after a team at the Dalle Molle Institute for Artificial Intelligence Research (IDSIA) in Switzerland experimented with that combination. Finding the performance of regular server processors too limiting, they turned to the parallel processing power supplied by graphics processing units (GPUs). Developed to render geometric meshes into realistic 3D scenes, they happened to support the kind of floating-point arithmetic the training of deep learning needs.
The work quickly pushed deep neural networks (DNNs) to the point where they routinely scored better than humans on narrowly defined tests, such as the ability to recognise road signs. In IDSIA’s work, the DNN could decode the meaning of a sign that had been almost completely bleached by the sun.
Image-processing DNNs themselves are today small fry. According to Linley Gwennap, president of analyst firm The Linley Group, the biggest models used for image recognition have more or less doubled in size every year since the early 2010s. But the development of a neural structure known as a Transformer has caused DNNs’ growth to go into overdrive.
At the analyst’s spring conference last year, Gwennap plotted a line showing ten-fold growth in neural capacity per year. By the time of the spring conference this year that line had moved up to 40-fold, driven by the unveiling of OpenAI’s GPT-3 followed by the Google Switch, with a maximum of 1.6 trillion trainable parameters, though these are broken down into multiple smaller models.
It’s a similar story for the Chinese machine Wu Dao unveiled in June. Also ten times bigger than GPT-3’s 175 billion parameters, it uses a mixture-of-experts configuration where a supervising neural network acts as a judge of which output is likely to be right.
Though they were developed primarily to handle text, Transformers have turned out to be surprisingly general-purpose and are now being used to replace the simpler convolutional layers in more conventional DNNs used to look at images. IBM Research has deployed them in a model that predicts plausible ways to synthesis novel chemicals.
The seeming generality of Transformer-based models encouraged Stanford University to set up a research centre dedicated to what have become known as foundation models, though the name came in for stinging criticism after a number of other researchers pointed out that though they are flexible they probably should not be regarded as a fundamental foundation of a new wave of AI. However, in its report describing why the centre is needed, one of the key issues they identified with these behemoths lies in the amount of energy and resources they need to be trained and to run.
In 2019, Emma Strubell and colleagues at the University of Massachusetts at Amherst estimated the energy needed to identify and train a Transformer-based DNN using the largest size Google had identified at that point consumed almost as much energy as the lifetime usage of five petrol-driven cars.
Google’s BERT-Large is more than a hundred times smaller than GPT-3 and, according to OpenAI, needed 75 times fewer compute cycles to train. OpenAI did not publish how many GPUs it used in parallel but it would most likely have taken a month of non-stop use across 1,024 Nvidia A100 cards to complete the network’s initial training.
There are performance reasons for going big. Facebook AI research director Laurens van der Maaten explained at a joint IBM-IEEE seminar last year that it is not just capacity that is going up. More training time works as well. “Every time you double the numbers of examples, the increase is larger if you have more parameters. You can see it in our study as well as in the GPT-3 study.”
In principle, you could keep going. Though Transformers don’t think like us, you could continue the scaling and try to overcome the problem of DNNs making mistakes when presented with something unexpected by giving them access to all the data they might ever encounter and enough space to store the learned parameter. GPT-3 needed to analyse some 45TB of internet documents to construct its language model.
At Hot Chips 33, Cerebras president Sean Lie claimed: “We’re outpacing Moore’s Law by an order of magnitude. At this rate we’ll soon need a football field’s worth of silicon just to run one model.”
Something has to give. The process may have started. For starters, the growth line for single, trained models has settled back to a little under 10x per year following the launch a few weeks ago of the Microsoft and Nvidia Megatron-Turing ‘natural language generation’ model. The model’s builders said they configured it to have 530 billion trainable parameters: about three times the size of GPT-3.
For some problems, bigger Transformers may not help. At the IBM Research conference on AI in October, senior researcher Teodoro Laino said there is an accuracy trade-off with the size of Transformers but that unless the molecules being described are huge, this does top out. “In the past few years, we’ve been able to understand that the size of the Transformers is not creating too much of a difference,” he noted.
For tasks where size still matters, the attempts to rein in DNN energy take three forms, though the one that may prove to be the most fruitful is today far from being the most popular, even though it may provide a major contribution to the quest for artificial general intelligence (AGI). The most obvious response, thanks to the billions ploughed into venture-capital funding, is hardware acceleration of the kind being pursued by Cerebras, Graphcore and others. These accelerators can deliver many more floating-point calculations per watt than GPUs can manage, which should make them the popular choice for big models.
Hardware acceleration has issues. Very often, the hardware will be optimised for certain types of workload. Early implementations, such as Google’s own Tensorflow processor array, were optimised to handle the layers found in image-oriented convolutional DNNs. Though Transformers are spreading rapidly through the machine-learning community, there is no guarantee that researchers will not come up with new structures that need different approaches to acceleration. Sparsity is a major issue, though this is an area where GPUs may start to fall behind quickly.
One thing today’s computer architectures like is a regular, dense data structure like a pair of matrices filled with numbers. This lets you design a pipeline that can run four, eight, sixteen multiplications in parallel using just one instruction. The shift in DNN design is towards sparsity, where the matrices used to calculate how much neurons contribute to the answer have zeroes, maybe lots of them. Hardware can take advantage of them by not trying to run what would be a pointless calculation. If the layout of these zeros is hard to predict, or you guess wrong on the kinds of patterns that will be used by real-world DNNs, you can still wind up losing vital throughput when transferring data to and from memory, although the overall number of operations needed falls. However, if you can introduce regular sparsity, skipping zeroes in a predictable pattern, it is possible to get GPUs to run the code efficiently, though this can hurt accuracy.
Lie says the Cerebras is designed around exploiting sparsity to the extent that a separate computer acts as a co-processor to reorder data so that it can be fed in parallel at high speed through the wafer-scale processor array. Similarly, Graphcore implemented addressing modes for sparsity in its processor designs, though for the first version of the silicon Knowles says there had been little demand for it.
A continuing issue for dedicated accelerators is flexibility. “The problem for domain-specific hardware is how to allow applications to scale,” says Gunnar Hellekson, general manager for enterprise Linux at Red Hat. The customer needs to be sure more machines of the same type are available. For GPUs, the chances are hardware is always available because it’s being shared with users with differing needs.
The second option for energy reduction is to get the software that builds and trains DNNs to find shortcuts. To avoid the training-time explosion faced by OpenAI with its GPT models, Microsoft has been working on a training library called DeepSpeed that culls a lot of the redundant operations used in distributed training as well as reducing the amount of GPU memory that is needed.
Without the optimisations, Samyam Rajbhandari, researcher at Microsoft, says GPT-3 needs a minimum of 256 GPUs just to store the parameters conventional training programs need. Part of DeepSpeed pushes parameters and data that are used less often out into flash memory in a way that avoids pushing up overall energy consumption. “With this, we could fit a one-trillion-parameter network on a single GPU,” he claims. However, without other changes, you might have to wait a while for the job to finish.
The other direction for software libraries is to find operations that can be removed without affecting the overall quality of results. One such approach developed for Transformers is to drop out the processing of parts of a layer or even full layers. Even random dropping can work. Another technique, which is used by DeepSpeed, is to drop layers progressively if the software estimates it is reasonably safe.
An even more aggressive reduction comes from experience with optimising DNNs for inference, especially for mobile and embedded computers. For some years, designers have increased the sparsity of the layers where they can through a pruning process that looks for weights that can be set to zero without damaging performance too much. This can cut the number of calculations ten-fold and in some cases has even increased model accuracy. But the results are highly variable.
What would greatly help is if training was not only able to take pruning for inference into account but simply avoid calculating weights that will wind up being thrown away in the end. Researchers from the Massachusetts Institute of Technology likened this ability to winning the lottery and came up with a way of identifying networks just 20 per cent the size of the original that could be carved out for full training, though it involved a little luck with the random numbers used to kick off the process. Google researchers followed up last year with their “rigged lottery” or RigL: a technique that does not depend on a degree of luck.
If you can drop out so much from the structure, does it not point to structural inefficiencies in the architecture of DNNs? Broadly, yes. The trouble is no-one yet knows how the empirical evidence obtained by randomising parameters or dropping them entirely translates in a systematic way to smaller, faster, leaner AI. As with experiments in pruning, various teams have found it is possible to give Transformer-based language models such as Google’s BERT quite severe lobotomies and still get it to process text almost as well as the full version. Research into the unexpected properties of Transformers has gone from scattered work under the banner of ‘BERTology’ to providing the rationale for Stanford to set up its centre for studying foundation models.
For neural network training, the next frontier may lie in rethinking the backpropagation algorithm that gave deep learning its new life in the 2010s. One major puzzle for AI specialists is why it has proven so successful at producing results given that neuroscientists are more or less certain it does not appear in nature: neurons just are not connected in a way that makes it possible. Work on alternative architectures such as spiking neural networks as well as hybrids that break up backpropagation into more manageable chunks may deliver more responsive and efficient neural networks. However, backpropagation is so well entrenched that it may take some time to develop an algorithm that can deliver equivalent behaviour.
At the TinyML Europe conference, a researcher at the University of Zurich, Giacomo Indiveri, pointed to work being done on neuro-inspired architectures that use analogue electronics and novel memory technologies in order to bring the energy of AI closer to biology. “The difference in energy is like comparing a swallow to a 737-jet engine. It will take more understanding to work out what to do.”
The models created in a machine-learning system use the concept of computational graphs. Graphcore’s software team mapped the BERT-base natural language processing model to the Graphcore Intelligent Processing Unit (IPU), a graph processor, to create a visualisation that bears a striking resemblance to the human brain (see lead image above). The computational graph is made up of vertices (‘neurons’) for the compute elements, connected by edges (‘synapses’), which describe the communication paths between vertices.
Sign up to the E&T News e-mail to get great stories like this delivered to your inbox every day.