Bug sitting on a PCB graphic

Bug zappers: hunting down the chip killers

It all seems to work. And then the software starts running.

Twenty years ago electronics design faced a looming problem. The number of transistors that could be laid out on each chip was spiralling up towards the tens of millions. But hardware design lagged behind – engineers were not turning out circuits quickly enough. Extrapolating the numbers to the mid-2010s it looked as though the world’s hardware designers could team up and maybe just about accumulate enough person-months to produce a single billion-transistor chip design.

Yet the design gap vanished almost as quickly as it appeared. In much the same way that designers at the board level moved from wiring logic functions together to using high-integration chips, the chip designers themselves started to buy in intellectual property (IP) offered by specialist companies such as ARM.

Having passed the billion-transistor mark, the IP-centric model has itself become a source of problems when it comes to checking that the final design works. The scale of the problem challenges traditional verification methods that were designed for smaller projects. “What are now low-level clusters within a system-on-chip [SoC] would have been entire SoCs in previous years,” Nick Heaton, Cadence systems and SoC architect, said at the Verification Futures seminar in February.

Michal Siwinski, vice president of product management in Cadence’s system and verification group, also notes this changing environment: “Traditionally, the most attention has been paid to implementation. It used to be the primary factor.”

Verification ‘explosion’

The ability to use EDA tools to reduce chip area by 5 per cent or boost performance to hit a higher clock speed made the difference to a project succeeding. Verification just made sure that the result would work and was comparatively straightforward. At the Design Automation Conference (DAC) last year, Steve Jorgensen reported that when he joined Hewlett Packard he was the only specialist verification engineer there. Now, the verification team outnumbers the designers. “The fastest-rising cost is engineering headcount and IP is not helping,” Jorgensen said.

The many ways in which IP blocks can interact leads to a combinatorial explosion in the tests that would be needed for exhaustive system-level verification. Small changes in the behaviour of an IP block can have knock-on effects on the rest of the system. “Any change I have to completely reverify,” Jorgensen says.

Siwinski adds: “When I started EDA, the notion was that verification was easy. Verification was not really a problem. It was just ‘stupid engineers’. There is a large automation gap that needs to be serviced.”

According to Jack Greenbaum, director of embedded software engineering at Green Hills Software, some of the most pernicious bugs his team faces are when the on-chip buses that link multiple processors together suddenly fail. But it is not due to an overt bug. The failure is triggered by the bus becoming unexpectedly congested, which then causes a series of tasks to miss their deadlines and, because they don’t finish their work in time, lock up the system.

Siwinski says cache coherency and other techniques used to boost performance add additional complexity to the protocols that are difficult to verify, raising the risk of encountering ‘heisenbugs’ that surface and vanish unpredictably. But verification is catching up with the demand.

Finding the best tools

Because software is going to be the stimulus that uncovers these problems in the field, code has become one of the main weapons in system-level verification. What could be better than the application that the system will run in the end? But it’s far from being a simple cure for verification woes.

The software developed for the project will generally be tuned to take advantage of the hardware. According to Klaus-Dieter Schubert, distinguished engineer at IBM, this has a significant drawback. The software is unlikely to probe the corner cases that so often make a system fall over and it flies in the face of techniques such as constrained random verification, where the testbench generates test vectors on the fly that are aimed mainly at minimum and maximum values where logic errors often lurk.

A second problem is that the application is unlikely to be ready in time. In many projects, the software used to test the target is being written and recreated several times over during the course of development. “Customers have built tens of millions lines of code, and then they throw it away at the end,” says Heaton.

Tools such as Cadence’s Perspec have emerged to try to fill the gap, providing a link between applications and constrained random verification. The Perspec tool employs descriptions of use-cases to assemble software on the fly from the basic C functions that would be used by the final application to access them. The solver used by Perspec may, for a video decoder, select a video format and randomise aspects of the video stream to test different scenarios. Larger scenarios, such as decoding a video while making a phone call, provide the kinds of tests that applications software would exercise.

“One scenario can generate hundreds of tests for you. It’s a bit like constrained random but for software,” Heaton claims.

Although the use of automated test tools can help accelerate software-based testing without having to wait for applications, there is still a strong pull to have code development start as quickly as possible. Managers want to meet stringent market deadlines and be better placed to debug subtle issues such as unexpectedly high power consumption, a factor that has troubled projects such as Qualcomm’s SnapDragon SoC. But it is difficult to develop software when the hardware on which it is meant to run is not ready.

The ability to run billions of cycles on demand is the key. Jean-Marie Brunet, product marketing director at Mentor Graphics, points to several recent examples of power consumption being the main cause of problems in the final product. Devices such as smartphones have failed in the market because they proved to be heavier on the batteries than their competitors. Sometimes, the power consumption can reach such high levels that safety routines halt operation to stop the chip overheating.

Ironically, a process technology introduced because it can be more energy-efficient has made the need to verify power before completing the design more important. Brunet says: “The move to finFETs changed the game.”

Almost all chipmakers decided to switch to finFETs in place of planar transistors for the 14nm class of processes because they are less leaky. This lets them stay running for longer. But finFETs are also faster at switching and draw more current when they do. This means designers need to pay attention to how often and how much logic switches on each clock cycle. Making sure software does not put too much strain on the system is the most effective way of controlling peak power.

This leads to another reason to shift software development earlier. “You want to verify the software at pretty much the same time as the hardware,” Brunet explains.

One option is to rush to implementation to get a prototype ready for software engineers to work on as soon as possible. Despite the eyewatering cost of mask sets for leading-edge processes, which now run into the tens of millions of pounds, and the fact that it can take close to three months to get the silicon back, a chip that is certain to have major errors in it may still run well enough to debug large applications. Brunet says: “It is not uncommon and was not happening five years ago.”

A possible halfway house is programmable logic. Users of field-programmable gate arrays (FPGAs) have been able to use their targets for prototyping for years. A prototyping board that deploys multiple FPGAs is today the closest in performance you can get to the final chip.

Giles Peckham, marketing director for Xilinx in EMEA, comments: “Verification tools have got a lot better but there is nothing like having a good platform to evaluate.”

Prototyping, even with programmable logic, takes time to put into action. The approaches used to design hardware for programmable logic are subtly different from those used for custom SoCs and it can take several weeks to convert one to the other.

Although they are more expensive in terms of capital cost and typically run software more slowly than an FPGA prototype, in-circuit emulators can take a hardware design almost as is. Techniques that avoid the need to boot an operating system from scratch on each software run help cut the time it takes to get to the point where useful verification work can begin not only to debug code but to collect data on how much power the target will use. As SoCs become more complex, emulation has become a tool used by a wider spread of industries.

Siwinski says: “It used to be used just for processor and GPU development. Now a whole slew of verticals are pushing into emulation.”

The variation in power consumption as the SoC moves in and out of different modes can be dramatic. The ramifications can reach down to the PCB as the sudden changes in current demand can tax the power-?delivery network that goes into the chips. Because they now operate at less than 1V, the current levels going into even a device designed for mobile use present a problem. The losses due to resistance even over short traces are significant, and contribute to local heat as well as a loss in battery life. Redesigning I/O layouts to provide plenty of copper to supply power and earth has become vital.

“Traditionally, tools looked at the PCB as one entity, packaging another and chip design yet another. When people were working on these issues independently it was kind of a broken process. We’ve now gone to a more holistic view,” says Humair Mandavia, executive director of Zuken’s Silicon Valley R&D centre.

By bringing the package, board and chip design together, engineers at the different levels can negotiate as to where to place I/Os to ensure that the board design can satisfy the device’s peak power needs.

The greater integration between the levels of design can provide greater opportunities to tune power. Chipmakers such as Intel are keen to pull DC/DC converters onto the SoC die itself to better support the ability to tune the voltage dynamically to many different blocks at once.

“Teams are working more in parallel. That can create more headaches. But EDA companies and tools are working together to help bridge those gaps. The key is predictability,” concludes Mandavia.

Recent articles

Info Message

Our sites use cookies to support some functionality, and to collect anonymous user data.

Learn more about IET cookies and how to control them