Designers of electronic hardware and critical embedded systems have embraced techniques to avoid expensive errors. Mainstream IT could learn from the experience.
The thing that all chip designers fear is the $50m paperweight. Everything looks good before the design heads off to the fab plant. But they have forgotten to test something important and the chip, when it comes back, barely flickers when the first volts are applied. Not only have the designers blown millions of dollars on the mask used to define the features that form the circuits, they have almost instantly pushed the project six months (or more) late.
That is not a good time to be a hardware designer.
Even the software in hardware-dominated design can run similar risks. Although software fixes have been called in to fix problems that appear in hardware – companies such as Nvidia have used software patches to correct small hardware bugs in chips – software can cause equally big and expensive project delays.
Chris Murray, vice-president of business development for software tools supplier LDRA, points to the stringent rules imposed by regulatory agencies such as the US Food and Drug Administration, which oversee the development of life-critical devices: 'For a medical device manufacturer, it takes 250 days to get approval. If there is a software problem after approval that requires a change and it can be just one line of code, that's millions of dollars of lost revenue while the revision is approved.'
Prepare for failure
As you would expect, hardware and embedded-systems designers are a cautious breed: you do not want anything critical to leave the lab until it has been checked over time and again. The hardware and critical-systems development communities have developed ways to better prove the correct behaviour of their designs – and there are lessons there for a software industry desperate to clean-up its image.
In the embedded-software space, software development is far more disciplined. It has to be. These design teams plan for failure: so that things do not get out of control when something does go wrong. Jesse Smith, R&D design manager at medical-device supplier Stryker, is candid: 'We assume that the software will fail. The question is, what is the outcome for the patient or the surgeon from that failure?'
According to Geoff Patch, software engineer manager at naval radar maker CEA Australia, the price of code simply not executing fast enough can be devastating. 'The SSN 28 'Sizzler' missile system is a fearsome weapon. It travels at up to Mach 2.9: that's about the same speed as a bullet fired from an M16 rifle. For a warship at sea, the line of sight might be 20 miles. With an inbound Sizzler, the defending crew has just 15 seconds to detect it and engage it,' Patch says. 'Every second counts.'
CEA does not need an enormous team to go over the code carefully to remove the errors. 'We built our third-generation radar software with no more than 18 engineers at any one time,' Patch claims. The key is not to fall into the trap outlined in Fred Brooks' book 'The Mythical Man Month' of throwing more resources and testing time into the project, but structuring it in a way that failure is less likely. One thing that the electronics world has depended on for years is the concept of the well-defined interface.
Databooks are packed with documentation on the signals carried by each and every pin on a packaged component. As the development of transistor-transistor logic (TTL), circuit-board designers have become accustomed to being able to mix-and-match parts from different suppliers at will, using the databook timing diagrams to work out how the individual parts communicate. It is a model that the creators of object-oriented software wanted to see translated into software, although it has never quite worked in the same way, largely because software is so easily modified.
Systems on a chip
In the past ten years, the same idea of reusable standard parts has applied to the functions built into integrated circuits.
Designers routinely buy in so-called intellectual property (IP) cores and plug them together to form a complete system on chip (SoC). Very often, these cores are delivered in the form of hardware description languages which are then synthesised into actual circuit layouts by the user. These languages look very much like software source code – the Verilog language is structured similarly to C; its competitor VHDL uses Ada as the language model.
Like software, the hardware descriptions can be altered and engineers often thought they could improve the code delivered by a third party. As with the temptation to modify objects and classes in the software world there is a catch: 'It's high cost and very high risk,' says Kathryn Kranen, president of hardware-verification company Jasper Design Automation. Potentially, the 'changes break the databook definitions and will cause circuits that communicate with the core to break down, and leave them with the dreaded $50m paperweight. The result was a focus on much more effective pre-silicon verification.
Realising that bugs could remain hidden in corner cases, constrained random verification offered one way of testing a design. Modern chip design is built around the concept of simulation: running a model of the proposed circuitry on a fast computer. Simulations run millions of times more slowly than a real chip. Even a simple circuit, such as one to compare two 32-bit values, would take thousands of years to check. As the average digital chip design is more complex, checking every possible value using simulation is impossible.
You do not always have to check all conditions because they can never happen. Designers can insert assertions into their code to tell users how a block should be used and test for violations of those conditions. For example, a designer can put in that an acknowledgement signal should follow a request after no more than 10 clock cycles. When that assertion is attached to the simulation, the tool will check for situations where that assertion does not hold.
Instead of testing all possible conditions, the testbench can randomly generate a subset. The unusual values that pop out can often find otherwise hard-to-find bugs. The idea soon caught on. 'Most verification environments are now based on constrained random techniques,' reports Kranen.
The same concept is available to software designers although it is used more rarely. The software languages Erlang and Haskell, among others, support a tool called QuickCheck. 'It allows you to build a model of the program you are developing. It will go in and start generating random sequences based on this model,' says Francesco Cesarini, co-founder and technical director of consultancy Erlang Solutions. 'These random sequences will find the most absurd bugs in your code that will never be found in static or manual testing.'
Sometimes exhaustive checking is necessary, as processor giant Intel discovered more than ten years ago when an obscure bug was found in the floating-point unit of the Pentium chip marq. As even extensive simulation was found wanting, hardware designers began to look at formal verification. The idea is far from unknown to the software world, thanks to work on languages such as Z. However, formal verification is now used on practically every digital chip in some form.
The most common formal verification is equivalence checking. This analyses a synthesised circuit to ensure it matches the original hardware description in case there are bugs – which there are – in the automated synthesis tools. Because model-generated code is being seen more in embedded systems – tools such as MathWorks' MATLAB, a high-level language and interactive environment that enables computationally intensive tasks to be performed faster than with traditional programming languages – are used routinely in control and radio systems; this kind of equivalence checking is beginning to move into the software domain to ensure that the generated code matches what the model creator built.
A lesser used technique is model checking, in which the assertions generated by designers are fed to a formal verification tool. This is the only practical way of checking something like a floating-point multiplier, and is the reason why Intel is a big user of formal verification. Microprocessor specialist ARM, meanwhile, has adopted formal tools from Jasper to make it easier to check the IP that it delivers to its customers, which include Apple and Qualcomm. Jasper Design Automation's Kathryn Kranen says: 'The designers at ARM use formal techniques to answer questions about the design; for example, is there dead code?'
Although it has taken longer, a similar trend is being seen in critical software. 'Formal methods have definitely crossed over. They are being used more and more,' says Murray. A big question in software is not so much whether the code does what it is meant to, but whether anyone knows what it was meant to do in the first place.
So-called 'feature creep' is a problem for many projects and can be a major source of errors. One way that managers of large avionics and military programmes have ensured that design meets requirements and is fully tested is to trace them from the original specification through to implementation. Specialist requirements traceability provide the information to link the two together. 'Requirements traceability is extremely important to ensure that you have built what you need,' says Murray.
'It's also to ensure that you have met the regulatory requirements: a lot of our customers have to meet DO178.'
Murray contends that requirements traceability from this part of the electronics world has much application in mainstream enterprise IT deployment, as it provides a way of ensuring that tests are aimed at the right areas. 'You can ensure that you can trace a requirement to its associated test; however, the have-to-use and want-to-use is sometimes the difference between management and the developers,' Murray says. 'From the developers' point of view, it is useful because the tool can show that they are, for example, 95 per cent complete. That is something that developers are actually enthusiastic about. The development team can say 'we are running late because you have added five extra requirements'.'
Perhaps the most important difference between hardware-oriented design and the world of mainstream design is the issue of concurrency. In hardware, everything runs in parallel unless you have put pipelines and registers in to make certain operations run in sequence.
Despite having structural similarities to C, Verilog is very much a language that describes concurrent operations. One of its most commonly used constructs is the 'always' block – it is wrapped around elements that run concurrently and activated when one of its conditions become active. And that can happen at any time.
Software writers have struggled with concurrency, largely because popular languages assume sequential operation. When telecommunications company Ericsson was designing its switches in the late 1980s, it decided that it needed a better language to describe how software would run inside them. Telecom hardware uses a lot of processors and concurrent operations so the language a research team developed there, called Erlang, had support built in.
These days Erlang is still owned by Ericsson Computer Science Laboratory, which describes it as 'a programming language used to build massively scalable soft real-time systems with requirements on high availability. Some of its uses are in telecoms, banking, e-commerce, computer telephony and instant messaging'.
Cesarini first came across Erlang while working as an intern at Ericsson in Uppsala, Sweden: 'When I came into contact with Erlang, I realised it was ahead of its time: Erlang is built for multicore and the Web, although it predates multicore and Web.'
The fact that things can happen at any time in hardware drove a move to simplify design: asynchronous events can lead to all sorts of problems such as race conditions where it is impossible to predict which state the system will wind up in if two events happen close in time to each other.
Synchronous design is the mainstay of modern electronic circuit design, reducing the points at which systems interact to defined points in time. Contrast that with the threads employed by procedural languages such as C which are essentially asynchronous. To prevent 'race conditions', programmers have to employ locking strategies that greatly complicate code.
Languages such as Erlang work around the problem of asynchronous behaviour using a different strategy. Cesarini says Erlang avoids a lot of the lock overhead by not allowing threads to share memory: they have to send data to each other. On top of that, threads are designed to have low overhead so that, in the case of an SMS message switch, a dedicated thread handles each incoming message. This avoids the need for a thread to manage its own message queues.
As hardware designers moved from laying circuits to hardware descriptions, they adopted other simplifying techniques and naming strategies that can be thought of as coding standards. Similarly, in the safety-critical embedded sector, coding standards have become mainstays of development.
'You have to look at history,' Murray maintains. 'Why did these coding standards arise? People realised over the years the same types of errors came from certain coding constructs. The standards ensure that you don't use the riskiest. Anyone who picks a coding standard will probably save a lot of time and money in programming.' *