Learning to design better
Image credit: Dreamstime
Where better to use artificial intelligence (AI) than in designing new machines to do machine learning? But sometimes you just need the hardware and not the AI part.
The amount of money that’s been ploughed into deep learning has convinced chipmakers they really need to put a lot more high-speed arithmetic processing into their devices. Just about every smartphone chipset now has some sort of AI accelerator sitting inside it in much the same way they incorporated graphics processors over a decade ago.
The AI and graphics processors perform similar operations: lots of multiplies and additions. It’s why the early deep-learning experiments homed in on graphic processors as a way to get results in hours rather than days. But they each have their own quirks that lend them to particular applications. Today, servers are running with high-end accelerators from the likes of nVidia that are nominally graphics processing units (GPUs) but which have evolved into engines that are able to run thousands of floating-point calculations in parallel. The way those calculations can be coupled together has changed subtly.
Graphics operations tend to be like filters, with lots of independent streams, but AI focuses more on multidimensional matrix or tensor manipulations that call for a lot more thought about how data flows through the hardware. By choosing to accelerate generic tensor arithmetic rather than specific AI functions, the hardware designers have opened the door to doing a lot more with those accelerators. And it is feeding back into the design of the hardware itself.
At the VLSI Symposia earlier in June, David Pan, professor in electrical engineering at the University of Texas at Austin, described a system for placing circuit elements automatically that can harness these accelerators without using AI directly. He pointed out the similarities between the linear algebra used in the forward propagation of neural-network training and the calculations used by analytic solvers developed for placement such as RePlAce. That software, developed by Professor Andrew Kahng’s team at the University of California at San Diego, forms part of Darpa’s OpenROAD Project, which is an attempt by the US government agency to build a portfolio of open-source hardware-design tools that could make access to custom silicon easier for smaller companies and even individuals.
“By doing this, we don’t need any training data: we are using the training structures provided by deep learning’s hardware and its software toolkits,” Pan explained. “We run on a GPU and rewrite the code using deep-learning toolkits to get the same quality of results as RePlAce but with a speed-up of 40 times. We can use the same paradigm to solve other EDA problems.”
Pan’s group has not avoided machine learning entirely. Another tool designed to show how nanometre-scale circuitry will appear on-chip after lithography uses a generative adversarial network (GAN) to make the predictions. Although GANs and other deep-learning models are computationally intensive, they can still work out faster than direct numerical simulations. This GAN works out about two thousand times faster than is possible with conventional simulation. Although the results are not quite as accurate as a full simulation, Pan said people in the industry considered the speed-accuracy trade-off to be viable. A tool like this can provide quick pointers to parts of the design that will be problematic, with full simulation used to identify the precise problem.
Though the Austin group avoided using machine learning techniques for placement, others reckon it will help. Young-Joon Lee, physical design engineer at Google, talked about work on a system to place larger elements such as memory blocks on chips that will themselves be used for deep learning. “Learning-based methods can gain experience as they solve more instances of the problem.”
The Google team opted for the same kind of reinforcement-learning system used in earlier experiments in playing games and robot control, though with some changes to make it work in electronic design. One major alteration was delaying the point at which the software calculates the 'reward' for the model’s efforts. The big problem with hardware in AI accelerators is that the problem of moving data between tensor operations is so critical that they need a lot of memory blocks, distributed across the surface of the chip. It is impossible to determine how good each placement is until all of them are in place. This meant only calculating the reward at the end, which makes the training process a lot slower. The placement also depends heavily on what the circuitry around these blocks does, so you cannot have a one-size-fits-all model.
What the Google team came up with was a system that can do an OK placement on circuitry it’s never seen before but which can tune itself quickly to do a better job. After learning the characteristics of the circuitry over a 24-hour period, they found it could obtain results similar in quality to those of a human team that might take six to eight weeks, though they look quite different from how humans place these blocks.
People tend to favour rectilinear and symmetrical layouts. The machine like curves. However, because the curvy layouts helped cut the length of on-chip wiring, Lee said the human team working on the next generation of TensorFlow processing unit (TPU) hardware at the company adopted similar layouts and were able to improve on timing in doing so.
Though you might expect Google to invest heavily in machine learning for these kinds of tasks, like the Austin group, they opted for more conventional techniques for the engine that places smaller circuit elements known as standard cells around the larger memory blocks.
“The reason we did this is because macro placement is more challenging. Standard cells are small and existing analytical approaches produce good results for them,” Lee said.
The tools that put circuitry onto reconfigurable hardware such as field programmable gate arrays (FPGA), which are now used heavily in AI processing because of their flexibility, have also turned to machine learning to guide design. Singapore-based Plunify launched a tool several years ago to help drive logic synthesis and layout choices for FPGAs.
Now the FPGA maker Xilinx has joined in with an update to its Vivado development environment. Although Plunify opted for an approach where the compilation tools learn from customer designs, with the option to use a cloud version for those who do not want to train it themselves, Xilinx has kept things a little simpler with ready-to-use models that will be updated periodically. The company is not collecting metadata from customers automatically in order to feed it but will use designs contributed for the purpose.
Xilinx has chosen to favour the use of machine learning to act as interactive help for users rather than performing the compilation and placement directly. “Right now, it’s acting as a ‘smart guide’ to navigate through many advanced strategies and constraints in Vivado, based on the design’s congestion,” says Nick Ni, director of product marketing at Xilinx.
Congestion is a key issue in any chip design because often there are not enough paths to route traces between logic cells in complex circuits. FPGA design is more prone to congestion as the routing options are more limited than in fully custom silicon.
“We also have more machine learning features in synthesis,” Ni adds. These help with estimating how many on-chip resources will be needed for a given design. There is often a trade-off in FPGA between using hardwired elements such as arithmetic units and taking up more chip space by assembling similar circuits from programmable logic. Choices over how those resources are allocated can make a big difference to how well a design will fit and at what speed.
Sign up to the E&T News e-mail to get great stories like this delivered to your inbox every day.