Chipmakers are encountering atomic-level design problems. Is the future full of mistakes?
'Garbage-in, garbage-out'. It's the one constant of the computing business. The computer never gets things wrong: it delivers the right answers - always. But give it the wrong question or data, and the result is rubbish. When you get a gas bill for a complete stranger, the computer gets the blame - but we all know someone omitted a semi-colon somewhere or forgot to add one. Garbage-in, garbage-out.
'Solid-state'. 'No moving parts'. They are the phrases manufacturers use to let us know that stuff is reliable. There is nothing to shake, rattle or roll. Nothing to fall off because the active components are part of the structure. The biggest problem is a dry joint - fixable with solder, assuming you can get to the offending pin. But the chip inside? It's solid.
Researchers such as Professor Asen Asenov of the University of Glasgow know, however, that solids aren't all that solid. Effects that electronics engineers used to be able to ignore are now being revealed, with the result that companies may have to ship chips where the circuits inside may not always function correctly. And, knowing that, come up with ways to correct the problem before the software, or the user, notices.
The core problem is variability. Experts speaking at the recent International Conference on CMOS Variability in London, organised by the UK's National Microelectronics Initiative (NMI), see the 22nm process as a critical juncture in the evolution of the silicon chip. Due to start production in 2012, the process will use transistors so small that Asenov claims: "You can count the dopant atoms. You can even count the number of silicon atoms in there.
"I can make one certain prediction about scaling. You will never make a transistor that is smaller than one atom. But even a bunch of atoms can cause a problem unless you have a technology that can place these atoms accurately."
The random placement of dopant atoms in the channel of a transistor can lead to big changes in performance, even between two devices sitting within tens of nanometres of each other on a die, to the extent that one works and the other a failure. Size is not the only cause of variability.
Jeff Watt, technology architect at Altera, says variability suddenly increased when the company moved to a 40nm process that used strained silicon. By deliberately stretching or compressing the silicon lattice through the use of silicon germanium layers or other techniques, it is possible to improve the mobility of electrons in the transistor channel and boost its current-carrying capability. But these strained layers are, in effect, packed full of dopants. The profile can change radically from one transistor to another, very often caused by interference from structures around the transistor when the dopant atoms are implanted.
For the last 30 years, design rules have made it possible for design teams to generate incredibly complex designs without much knowledge of the underlying physics.
Rather than deal with the effects analytically, the response from the industry has been to take account of the potential for variability through ever more rigid design rules. By constraining the shape of structures around each transistor, you can remove or at least ameliorate some of the shadowing and interference effects. But the rules are becoming more unmanageable with each new process.
Jean-Marie Brunet, director of product marketing for Mentor Graphics' design-for-manufacturing group, says the sphere of influence around a transistor is expanding. Most designs are based on library cells - pre-designed combinations of transistors. Until recently, the library designers were the only people affected by shape-dependent effects.
"At 32nm, the environment outside the cell can affect variability within the cell," Brunet claims. A study from a couple of years ago shows what can happen. The gate of a transistor near the edge of a cell of 65nm transistors could vary by as much as 5nm. As the physical gate length of the transistor is substantially less than 65nm, this represents a factor of more than 10 per cent.
"In the centre of the cell, the transistors have the smallest value possible. The variation increases towards the edges," Brunet explains. The answer was to surround the cell with dummy pieces of polysilicon, the material used to form the interconnect between transistor gates.
However, this is only a partial solution. As the design rules get tighter and more space is devoted to dummy features that have the effect of spreading the design out, the benefit of scaling soon disappears. Jen-Hsun Huang, CEO of graphics chipmaker nVidia, says the utilisation rate of silicon is now only about 60 per cent: some 40 per cent of the die area on a high-performance chip is effectively wasted. Process engineers are delivering ever tighter geometries only to see those gains lost in the design.
Process engineers are trying to buy back some time by not scaling the transistor gate itself as quickly as in the past. Technological changes - and the race among processor makers to win the gigahertz war, before the truce was called when power consumption problems made the war unwinnable - meant gate length scaled much more quickly than the half-pitch measurement that gives each process its name. The half-pitch determines how closely packed transistors are. On a 65nm process, it is not unusual to find transistors that have gates or channels 40nm long, sometimes they are as short as 30nm.
Referring to the International Technology Roadmap for Semiconductors (ITRS) published by research consortium Sematech, Asenov says: "The expectation for scaling of the channel length in the 2008 update has fallen back very close to the half-pitch. We used to expect 5nm or 6nm channel lengths. Not any more. I don't expect a fall in channel length below 10nm."
Watt is optimistic about the next process to go live. He does not expect variability to increase as much in the move to 28nm: "The 28nm process gets a benefit from high-k, metal gate technology, although it's not enough to compensate for the reduced channel length. For variability, we believe solutions exist down to 28nm, and at 28nm there are some encouraging results. Beyond that is a little less clear. But the industry has faced challenges in the past that looked insurmountable and we solved them."
The problems caused by using lots of dopants is one reason why some technologists believe the industry as a whole will need to move away from the conventional 'bulk CMOS' transistor to more exotic, and difficult to make structures such as FinFETs or ultra-thin silicon-on-insulator transistors.
"We expect significant problems around the 22nm channel," Asenov says. "It will be the last technology node to use bulk MOSFETs in mission-critical applications."
Kelin Kuhn, an Intel fellow and director of advanced device technology at the world's largest chipmaker, is optimistic that problems with variability can be overcome, but notes: "I do see a future in ultra-thin body or FinFET-type devices. We are going to have a need to move to these devices. Probably not at 22nm but maybe at 15nm."
However, 15nm has a problem in that it will probably need a wholesale change to the way shapes are defined on the surface of the silicon, demanding a technology that is still not ready and has been delayed by 15 years. Some, such as Huang and Sani Nassif, manager of tools and technology at IBM's Austin research lab, think the process that follows the 32nm and 28nm generations - expected to go live later this year - could mark a hiatus in the development of silicon processes for this reason. "The industry could stick at 22nm for some time," says Huang.
That does not mean process development will stop, just that what engineers do to improve density will change. Instead of simply making the transistors ever smaller, lost ground in design and architecture could be regained. "We could call it '22F' - 'F' for fast," quips Nassif.
Further scaling will increase the risk of failures as the devices are used. The structures are so small that running high current densities through them for any length of time will cause them to degrade and exhibit even more variability. Asenov presents this as an expanding cloud of variability. "At t=0, the devices are well tuned for variability. But some will go out of spec over time. The question is how you deal with this at the design level. Will you take this into account during the design phase? By doing this, you will probably lose a lot of performance. Or you can introduce techniques to measure, detect, understand and compensate for this degradation."
One way out is to move to 3D structures. You can optimise different parts of the chip separately by having them on different layers. More importantly, says Nassif, you can have individual power supplies for each part. This is important for memories, which exhibit the worst problems with variability in today's designs. By adjusting the voltage supplied to each piece of the memory array, you can claw back margin that cross-chip variability steals.
Another option is to work with variability and assume that parts of the chip will fail and even design them to, on occasion, make a mistake. It's an approach that Krisztián Flautner, vice president of R&D at ARM, says chipmakers could use to improve performance. He reckons the idea of using margins to hide variability is running out of time. If they are made increasingly conservative, chips will not satisfy performance targets. "Being wrong by 0.1V means that you may be sacrificing 30 per cent performance," he claims.
One approach that Flautner has worked on is called Razor. It targets the biggest headache that designers have in implementing on-chip circuits. Very often, even big variances are manageable. But designers have to deal not just with local variations but the changes from wafer to wafer as conditions alter in the various chemical processes they undergo. It has been known for wind conditions outside the fab to alter the performance of chips made inside.
When running designs through a simulator, engineers will analyse the behaviour of the chip on different 'corners'. One comparison is between fast and slow corners. "On the slowest corner, delay can vary by only 20 per cent and the device will fail," Flautner explains.
A knock-on effect lies in power consumption. Flautner sees power as potentially on the increase from one process to the next because old techniques to reduce it have hit the wall. "One of the main things it impacts are the set of variables used in design and the methodologies that are deployed. And they are influenced by variability and uncertainty in the process.
"Normally, we shoot for a certain performance at a certain power budget. And then we overdrive it for higher performance. You can get significant improvements in power consumption if you are willing to reduce the clock frequency. But what if I don't want to stay on that curve? What can I do?" Flautner asks.
The Razor answer is to drive circuits very close to the point at which they fail. Instead of picking a voltage and then adding some margin to account for variation, a Razor circuit is designed so that it operates very close to the point at which the circuit fails. For example, if you drop the voltage too far, the circuit does not switch before the next clock cycle begins. The register at the end of that logic path contains the wrong answer. Without a safeguard, that error will propagate and probably bring the system to its knees.
Razor watches for errors and tries to catch them before they propagate. A shadow register captures the result that finally emerges from the laggard logic path. If it does not match the result that the main register has then the processor retries the operation. Conceptually, it's similar to the techniques used in fault-tolerant computers where independent circuits compare results and take action if one of the execution units is out of line.
"We are trying to extend speculative execution to timing," says Flautner, referring to the techniques used in superscalar processes to run code that might not be needed in order to find ways to do things in parallel. And, like speculative execution, sometimes the processor has to back up and start again. Do that too often and you start to miss deadlines. So the key to Razor is to expect to make mistakes: just not too many. That is where ARM is putting much of its effort: how much extra circuitry is needed to trap errors; how fast can it react; how many errors is too many?
Although the approach can reduce the voltage needed to drive a circuit, which cuts its power consumption, the extra circuitry adds its own overhead. "You can end up with very high power overhads. You can often end up spending more effort on detection than you can claw back on margins. Some early implementations had 80 per cent power overhead. Now we are close to 5 to 10 per cent and, in some cases, it is in the noise. But you have to pay attention to it."
ARM built a test chip last year to try out Razor on real software: it was an ARM processor fitted with the extra detection circuitry. "The concept is simple but it cuts across so many areas that getting something out of it is no foregone conclusion.
"We can tolerate or even eliminate margins. We think the clock frequency increase is about 40 per cent and we think we can do 20 per cent better than the state-of-the-art designs," Flautner claims. "We are doubling the energy efficiency of our devices doing this."
But there are big variables in how quickly the core circuits make mistakes. "The point of first failure is significantly impacted by the temperature and the code you are running. We are trying to get very quick reaction times. But you still want a very low error rate and not try to push past that. We have explored techniques where we have pushed far beyond the limit. But the key is to find the edge and stay close to it."
Other adaption techniques may be less aggressive that those being worked on by ARM but could be used more widely in the next generation of silicon, following on from work by companies such as Intel, which put some adaption circuits into its Itanium processors several years ago.
"Adaptivity and robustness techniques are making their way from academia into industry," says Professor Andrew Kahng of the University of California at San Diego.
"By 32nm, I think adaption techniques will be necessary," said Andrew Appleby, physical design technical lead at NXP Semiconductors.
Sharad Saxena, a fellow at process-yield specialist PDF Solutions, agrees: "The work at higher levels on variation-sensitive circuits has been somewhat limited. But we need variation-aware circuits."