Safety requirements are forcing engineers to think the unthinkable about their designs.
Drones, self-driving cars, robots breaking out of their industrial safety cages. The computer is now in control of many motorised systems that, if they go awry, could easily be deadly. And purely by accident. The arrival of cyber-physical systems in mainstream life presents a challenge to engineering in areas that have not had to look at safety in such detail before. And it’s overturning traditional practice.
The problem with safety-oriented design is that it discards some of the core assumptions of mainstream electronic system design. Functional verification assumes that circuits work properly. Faults in the circuit are caught by test and any suspect chip junked. If something fails in the field later, that is not a problem for verification because it is not a systematic problem and so does not affect all the other devices of its type.
When safety comes into the situation all bets are off. A chip failing in the field is then a big deal: it may cause antilock brakes to stop working or make the steering unresponsive. The question is no longer what should happen but what could possibly go wrong. “Gates all of a sudden don’t behave the way they should. It’s a totally different problem to solve. Who in their right mind would verify a state machine where states jump for no reason from one state to another?” Avidan Efody, verification architect at Mentor Graphics, explained at the Verification Futures conference organised by T&VS earlier this year.
But dealing with the out-of-control state machine is what safety standards such as ISO 26262 require. The trick to making safety analysis manageable is to reduce those sources of problems down to those that can result in dangerous errors.
Engineers in military and aerospace have contended with that issue for decades and developed procedures to analyse the sources of problems. Radiation is one of the biggest concerns. High-altitude aircraft, satellites and spacecraft have to be designed to contend with the high probability that a gamma ray or alpha particle will upset the operation of a transistor.
Ionising radiation hitting the silicon substrate will create a cascade of free electrons that can tip the logic state of a transistor inside a circuit from its correct reading to its opposite. The good news is that the effect may be transitory. Only if the circuit is clocked at that point may the wrong state be captured. Even if it’s captured in a register, downstream logic may override its effects, so again there will not be a problem for safety. But if the injected error escapes and starts to affect operation, that is a problem that verification or what Efody calls ‘dysfunctional verification’ needs to catch.
Memories and registers are particularly prone to storing the wrong state as they hold their contents largely by recycling charge. Designers of computers have dealt with this problem for many years using error checking and correction (ECC) codes. Because the probability of a single-event upset increases with memory density and with the reductions in size of the transistors that implement them, computer designers have gradually increased the level of ECC used on memory buses.
Moore’s Law gives rise to another source of potential unreliability. As transistors shrink, they tend to wear out more quickly. A study by ARM working with ABB, the German Aerospace Centre and several universities found a number of sources of failure – most of which are exacerbated by reductions in transistor size and the thickness of layers that separate the key elements inside them.
Designers of safety-critical hardware have for decades tended to be more conservative about moving to newer semiconductor processes partly for this reason.
“The solutions that we produce for these segments are deeply embedded. They are not PC or gaming products being reused. They are built from the ground up with an automotive use-case in mind,” says Allan MacAuslin, NXP, contrasting the company’s position with that of nVidia, which aims to sell its graphics processors into car auto-navigation systems on the back of successful results in technologies based on neural networks.
“The challenge for automated driving is to sense the environment and use that information to decide on motion planning. We see deep learning as one way to skin the cat but you can’t solve the entire problem with deep learning,” MacAuslin continues.
“The application calls for high-performance processing. But that often comes at the cost of long-term reliability and functional safety. Being able to clock a processor quickly often comes at the cost of reliability: a robust transistor switches more slowly. Our aim is to provide an embedded solution that provides machine learning, but without compromising on long-term reliability.”
MacAuslin says the company intends to focus its efforts on the 28nm processes rather than later finFET-based processes that are now used in PC and mobile phone chips. STMicroelectronics has indicated that it will be more aggressive in pursuing leading-edge processes for future safety-critical processors for the automotive market in particular on the basis that it needs the higher performance and density of the 10nm finFET process to support complex software such as neural networks.
One way to keep advanced processors and the software they run in check is to monitor them closely. The designers of systems used in satellites and other areas where failure cannot be tolerated often opt for triple modular redundancy. As the term suggests, each circuit is duplicated twice and then voting logic is used to determine the correct output. If one logic path fails, the other two circuits will outvote it. There is a clear trade-off in terms of density. There is little point in pursuing advanced processes if the fall in reliability calls for more of the circuitry to be triplicated. That extra logic, in effect, costs at least two process generations.
Reliability or redundancy
The trade-off is seen in satellite design, where manufacturers will readily switch technologies. They use more magnetic memories and FPGAs based on antifuses that are individually more expensive than those based on conventional memory because they work out less costly than having to triplicate those parts of the system.
An approach used by suppliers such as Infineon Technologies and NXP is to pair high-performance processors with slower cores that run regular checks on them. Hardware and software checkers help deal with the problem of circuits breaking some time into their active life by checking their results.
NXP’s BlueBox platform, for example, couples a multicore processor today based on the ARM Cortex-A57 with a second chip, called the S32V, that acts primarily as its monitor. The S32V also provides some acceleration for self-learning systems.
To minimise the chance of systematic design errors creeping into the equation, the S32V contains two Cortex-A53 processors that are implemented differently. “The cores are compiled at different times and the logic gates for each are laid down in distinct islands on the SoC. We do that for all the redundant circuitry on the chip, together with diagnostic hardware on voltages, clocks and resets. We don’t rely on software to do reciprocal checking of all the tasks, which means the full performance range of the processors is available,” MacAuslin claims.
To verify what will happen if a hardware fault develops or a latent software bug emerges because of changes in conditions, design teams inject errors into a model of the system. With the proper safety checks in place each error that could propagate to an output that controls something mechanical rather than vanishing needs to be caught.
The first requirement is to have a list of requirements that constrain the system to do nothing considered unsafe. For example, pushing the stop button should halt a lifting machine but not in such a way that it drops a rebar on the ground. An anti-collision system in a car should slow and stop the vehicle before it slams into the back of another, even if the driver seems to have their foot pressed hard on the accelerator. Each component, whether software or hardware, needs to be checked to make sure that its behaviour is consistent with those requirements.
“You have to have full requirements traceability though all the software stack and all of the electronic systems,” says Tom Beckley, senior vice president for R&D at Cadence Design Systems. “You must verify what should happen and what should not happen.”
Fault injection is one of the keys to determining whether the unthinkable can happen. Can the lifting machine be fooled into thinking it is holding nothing aloft when the stop button is hit?
The problem, as Efody outlines, is the sheer enormity of the task. When anything can go wrong, how does verification complete in a reasonable time?
The technique that currently provides the greatest assurance to designers is fault simulation, as it readily demonstrates whether a single failed gate will propagate through to an output or whether it vanishes or is handled by safety-checking logic.
Fault simulation has the further advantage of being heavily researched up to around a couple of decades ago, as the technique underpinned the theories of semiconductor test.
Test diverged from the needs of safety-conscious design when scan-based technologies appeared at the beginning of the 1990s. The scan chains inserted into most chips – with the exception of those made for SIM and credit cards – made it possible to see where logic gates had failed so that those devices could be weeded out. Safety-?conscious design does not need that level of visibility, and scan logic simply adds more gates to check for possible errors. So, rather than using today’s fault-simulation technologies, companies have reworked the test tools that pre-date scan.
In 2014, Cadence launched a tool that is functionally similar to its old Verifault-XL product but which employs approaches such as compiled code to improve its performance. Mentor’s fault simulation tool allows the simultaneous injection of multiple faults that do not interact with each other.
Although simulation tools are taking advantage of code speed-ups, another likely move is towards greater adoption of formal-verification techniques as they can cover much more of the state space in the same time. The problem, says Mark Olen, product marketing group manager at Mentor, is that customers are wary of the results from formal tools, fearing they may fail to write tests that catch problems. They see fault simulation as better understood.
Dave Kelf, vice president of marketing for OneSpin Solutions, says the results from the company’s formal tool provide a way of pruning the list of faults, weeding out those that are less important to behaviour, to come up with a list of those that need to be confirmed by simulation later. “Even in interacting logic we can put in multiple faults as long as you have the memory,” he asserts.
As technologies such as formal verification become more accepted as ways of identifying possible causes of concern, attention is likely to shift to more advanced ways to improve throughput and try to identify much earlier the types of logic and circuitry that need the most attention, and to reduce the amount of time spent on those that do not. But, now that electronic safety is moving into so many more systems, the subject is going to get plenty of attention for some time.