ARM goes multicore for faster embedded 3D
ARM has developed a high-end version of its Mali graphics processor that uses multiple processor cores to deliver a performance improvement of up to four times over its predecessors.
The company has exploited its use of a tiled architecture to let multiple processors work on the same frame but avoid massive increases to power consumption caused by much more frequent writes to memory. The company has increased the speed of the vertex processor that produces the basic wireframe triangles that will be used to render a scene and multiplied the number of what the company calls fragment processors, which fill in those triangles with shaded textures.
Chris Porthouse, senior product manager for ARM, said: “With the vertex processing, we have more or less tripled the performance over the Mali 200 for high-definition-screen gaming. The other change in performance is in terms of the pixel processing. You can deploy more fragment processors, and provide a four-times performance improvement over the [existing] Mali.”
The Mali 200 has a single vertex processor and one fragment processor. The fragment processors work on different parts of the screen in an extension of ARM’s tiled rendering scheme. The company worked out that one of the major sources of power consumption in 3D graphics is in the energy needed to write data out to an external framebuffer when using so-called “immediate-mode rendering”.
Remi Pedersen, product manager for the Mali core, explained: “A typical immediate-mode renderer works across the full framebuffer. It can produce pixels anywhere in the buffer at any time. The memory has to be very fast and with very low latency and you are touching the same pixels many times over. You have multiple accesses to the same pixels over and over again.
“Another drawback is that large framebuffers show unpredictable behaviour,” Pedersen added: in a multiprocessor environment, cache coherency is a problem and can lead to dramatic changes in rendering speed if cores compete over areas of the buffer that have been cached.
With ARM’s approach to tile-based rendering, the vertex processor produces a sorted list of triangle coordinates that are ‘binned’ into 16x16 tiles. Each fragment processor then takes on a tile and renders all the pixels in it into a small area of on-chip memory. On when that process is finished are the blocks written out to the actual framebuffer. Each fragment processor can work on its own tile without interference. Pedersen said ARM has tried different tile sizes but found that 16x16 arrays provide the most cost-effective dimensions. Although pixels may be written multiple times Pedersen said he sees the technique as a combination of tiled and immediate-mode rendering the energy cost is much lower per write because the tile is so small and cached on-chip. “The tile size has been kept fixed through all the different generations of the Mali core.”
“Each pixel in main memory is only touched once,” Pedersen added.
Porthouse said a version of the core with two fragment processors takes up 9mm2 on a chip made on a 65nm process. Each additional fragment processor in the configurable core adds another 3.5mm2.
The company sees applications for the Mali 400 in high-end handsets, set-top boxes and portable gaming systems. “Gaming is driving the wave at the high end, ” said Porthouse.
ARM aims to exploit the growing demand for 3D graphics in portable consumer devices