To try to keep up with demands for cheaper access to high-density logic, programmable-logic companies are planning radical shifts in how they design their chips.
Invented 25 years ago, the field-programmable gate array (FPGA) threatened to reshape the world of chip design. When Xilinx introduced the first device, its capacity was way short of that of a true custom chip, generally implemented using standard-cell design. Although you did not have to pay for any non-recurrent engineering (NRE) fees for things such as mask-creation, application-specific integrated circuits (ASICs) and metal-programmed gate arrays were not so expensive that you needed to go to FPGA to get a project off the ground.
But NRE and design costs ramped up during the 1990s and became eye-wateringly expensive in the following decade. FPGAs gradually looked more attractive and an ideal substrate for bringing together cores from different sources with custom logic - a reprogrammable mash-up device. Yet, from 2000 onwards, the FPGA market has done little but tread water. The FPGA makers don't make much more money than they did ten years ago.
The number of ASIC design starts has declined but the lion's share of the income has gone to application-specific standard products (ASSPs), which provide a fixed set of functions aimed at a certain kind of system, rather than the more flexible FPGA. The reason? The FPGA remains, for many volume systems, too expensive and power-hungry.
However, the problem for the FPGA is volume cost. Even for the high-value systems that the FPGA companies excel in supporting, a rapid rise in complexity threatens even the viability of the model. But the FPGA suppliers are ready to exploit more radical techniques to at least get back on the cost curve if not improve it. For Altera, the number-two player in FPGAs, the next generation of devices won't be quite as programmable as the last - although you will be able to flip some of the logic inside them as they run.
For the generation of FPGAs to be made on a 28nm process, Altera is making several changes. One change is the merging of HardCopy - the mask-programmed gate array used to provide fixed versions of customer design for less money per die - with the mainstream programmable logic family of Stratix parts.
Altera is staking the future of Stratix on a resurgent communications business. 'What is driving the most increased need for processing is video. YouTube last year had more traffic than the entire Internet in 2000,' says David Greenfield, senior director of product marketing for high-end FPGAs at Altera.
'Our customers are now looking for 400Gb/s systems and a move from 3G wireless to 4G and LTE.'
To get an FPGA to cope with 400Gb/s, you need a lot of parallel processing which, according to Greenfield, will drive migration to denser 28nm devices. 'You need something in the 320K logic-element range to do 100Gb/s.'
A back of the envelope calculation implies a fourfold increase in density needed to get to 400Gb/s with no increase in clock speed. 'We can't do it at 40nm. The biggest reticle-busting die we have at 40nm is 530K logic elements.'
Even 28nm falls a bit short, not just on density but on power. 'We used to get a 20 per cent reduction in core voltage with each generation,' says Greenfield. 'It will probably drop from 0.9 to 0.8V at 28nm. That reduction isn't going to buy us a lot in terms of active power. Plus you have higher leakage power to contend with.'
So, by moving to 400Gb/s transmission, wireline communications vendors would get almost no power reduction, which is not good news when institutions such as Bell Labs are clamouring for energy savings in Internet infrastructure equipment.
One answer is to bake into hard logic as much as you can. That is where HardCopy comes in. 'Embedded HardCopy blocks will be a new way to improve density,' says Greenfield.
Although less dense than standard-cell ASIC circuits - by about a factor of two according to Greenfield - the HardCopy gate-array sections are far denser than FPGA and, because they use a fraction of the transistors, use less power. Even for its own hard cores, Altera is beginning to favour HardCopy because it's cheaper to design with.
'We have used hard blocks in our architecture for more than ten years, implementing them using standard-cell technology. What we are doing in this generation is using HardCopy functions to get there. It lets us deliver different flavours of the products to customers that fit the needs of their applications.
'If there is a way for us to drop the cost of doing product variants, then it allows us to do more and provide more market-specific products,' Greenfield adds.
FPGAs made using the 28nm will have strips of logic array designed for HardCopy alongside a larger fabric of fully programmable logic. Initially, Altera will use the HardCopy area to do its own market-specific products. 'But it is envisioned that it will be available for customers to use as well,' says Greenfield.
Because of the tenfold difference in density, you won't need a lot of die area set aside for HardCopy to implement a full FPGA's worth of circuitry. And keeping down the area needed for reconfigurability is where the partial reconfiguration comes in.
Another change is the decision finally to adopt partial reconfiguration so that parts of the logic can be switched in and out while the rest of the system is still running.
'Partial reconfiguration has been around for a long time,' says Greenfield. 'But this is a fairly significant shift for Altera.'
Altera's argument is that it can make partial reconfiguration more tractable for designers through the incremental compilation supported by its Quartus design tool. Essentially, blocks that are to be switched in and out are compiled and then fixed as sub-chips so that when changes are made, any rerouting is carried out inside the chunk of FPGA set aside for them.
Although Xilinx has had partial reconfiguration for longer, the market leader in FPGAs says it has worked to make the process easier and to address the problem of increasing power consumption in high-end communications designs 'by reducing the number of gates that are used at any one time,' says Chuck Tralka, Xilinx's senior director for product definition.
'We have customers who use the partial reconfiguration to manage the I/O terminations on-chip. I/O power is a fairly significant portion of most customers' power budgets. This makes it possible to manage the type of termination dynamically,' Tralka explains.
'The biggest improvements in partial reconfiguration are relative to the toolsuite,' Tralka says. 'It is a fairly complex technology to adopt because users now have a time dimension that they have to manage.'
Focusing reconfiguration on improving density, Tabula decided the best way to handle the tools problem was to hide the time dimension entirely. President and CTO Steve Teig is conscious of the bodies of dead companies that have tried to attack reconfiguration - and the FPGA market itself before.
Teig reckons the two advantages that Tabula has over reconfigurable-computing predecessors, such as Chameleon and Quicksilver, which disappeared more or less without trace, are two-fold: 'We can reconfigure thousands of times faster than anyone who preceded us. The second part is that we have chosen to hide the revolution.'
One problem with reconfigurable machines up to now is that the cost of changing logic on the fly has been pretty high. It's too hard to get the gigabytes of data into a device to sustain any kind of throughput. For example, the architecture put together by Stretch was superficially interesting but hobbled by the speed at which you could shovel data into the device.
What Teig calls the 'spacetime' architecture is different because the memory used to store the chip's state is held so close to the actual logic. It also benefits by the way in which scaling trends have provided some things, such as raw clock speed, in excess without the ability to use that speed fully. And active power continues to creep down, albeit more slowly than before, while static power consumption inexorably moves up. As static power is largely proportional to the number of idle transistors on a device, having a high ratio of active versus idle transistors is good. And being able to get the same amount of logic out of fewer transistors is also good.
'I believe spacetime is fundamental to the speed of computation,' Teig claims. 'For any technology that you might use to do computation it becomes cheaper to do reconfiguration locally than to send the signal somewhere else for computation. We've been trying to look 25 years ahead with this approach.
'At a certain point most devices will be spacetime because it will take less energy to reconfigure than to send the signal far away. This is the next wave of computing strategy,' he adds.
Tabula divides a raw clock signal of more than 1GHz into multiple sub-cycles - Tabula refers to these sub-cycles as folds. Each sub-cycle, the logic elements and muxes read out their state from the local memory, perform the computation and then move onto the next sub-cycle. Signals still in flight at the end of each sub-cycle are caught by transparent latches within the programmable-interconnect section and held until the next sub-cycle that needs those signals as inputs. Any logic behind the latch once it's closed can be reused by independent logic on subsequent sub-cycles.
The transparent latch - a 'time via' in Tabula-speak - looks to the hardware description level (HDL) code like a buffer. As just about every chip design tool on the market can insert buffers without having an effect on the HDL code this avoids the need to pipeline the logic to allow long sequences of combinatorial logic to be 'folded' onto a constantly reconfiguring collection of logic elements. Which is why Tabula calls the sub-cycles folds.
Teig himself concedes that trying to think of the architecture as constantly reconfiguring is very difficult to deal with. It's much easier to think of the folds as the way to a virtual 3D chip. Each fold is a layer of logic connected by the time vias. So, in the initial parts, made on a 40nm process at TSMC, provide eight layers of logic using just one physical surface.
Why does this make the life of the tools writers easier? Because a 3D place-and-route algorithm is not conceptually different from a conventional 2D system. You can use the similar cost functions to those used in place-and-route tools that deal with an entirely physical set of logic elements and wires. A time via that connects fold one with fold eight has a longer 'length' than one that joins two adjacent folds, and it blocks more of the virtual interconnect. So, optimisation tools can attempt to minimise that cost to get better overall chip utilisation.
As Teig explains, it gets much easier to conceptualise once you've disposed of the idea that the architecture is time-slicing: 'For 99 per cent of my work I pretend the chip is 3D: it's much easier to visualise the x, y and z axes rather than trying to visualise what comes into existence when, which just gives you a headache.'