vol 7, issue 8

WAN - chipping away at long-distance data

20 August 2012
By Philip Hunter
Share |
WAN accelerator graphic

WAN accelerators enable large CAD-CAM data transfers between development sites around the world

Cisco’s GTS graphic

Cisco’s GTS shapes traffic by reducing outbound traffic flow to avoid congestion

Volcano eruption

Parallel processing algorithms were originally designed for supercomputing applications like seismic exploration

Hitachi’s WAN Accelerator

Hitachi’s WAN Accelerator: application-level speed-up

Qing Li, Blue Coat Systems

Qing Li, Blue Coat: where there’s a way...

Wide-area network optimisers use smart ways to gain more value from WAN links through new ICs.

The dedicated wide-area network WAN optimisation device is alive but not particularly well, judging by the emerging strategies of leading vendors. They are all busy leapfrogging over each other to be seen as pioneers of an emerging communications software age, where all the critical processes involved in speeding-up transmission of data over the WAN between applications for enterprises run on generic processors such as the Intel i7.

This may not seem a good start for a review of integrated circuit (IC) developments in WAN optimisation; but the link between respective optimisation processes and the underlying silicon found in such devices is compelling, especially as the demand for these products is growing. It is just that, increasingly, it is not what has been regarded as traditional dedicated silicon, either an ASIC (Application Specific IC), FPGA (Field Programmable Gate Array), or System on Chip (SoC).

The distinction between generic processors and dedicated ASICs has been blurring ever since the first floating point accelerator was incorporated in early integrated circuits in the mid to late 1970s, and now so-called generic chips used in PCs, tablets, and smartphones, for instance, incorporate an array of graphics, video, maths, security, and network processing functions that at some time in their history were performed by bespoke chips. This multi-functionality, combined with huge increases in raw power, are factors that have turned WAN optimisation vendors away from dedicated silicon, as was indicated by Qing Li, chief scientist at one of the field's main vendors, Blue Coat Systems.

"These generic Intel processors are almost becoming giant SoCs (System on Chips) themselves," says Li. The term SoC is a slight misnomer, as no system is 'purely a chip', and there will always be some subsidiary or additional components such as flash memory or a hard-drive; but it is useful in that it highlights the degree of convergence between generic and dedicated silicon.

First came the ASIC, which initially comprised purely custom logic designed to minimise surface area and therefore cost, while reducing energy consumption and maximising performance, because the signals have less distance to travel in executing the task than on a generic processor. However, the ASIC could not be reused, and so development cost had to be weighed against the likely volume of production and time to obsolescence. The FPGA then evolved as a compromise by allowing some field upgradeability, by reconfiguring connections between logic components of the chip, relevant for tasks where it is possible to foresee in broad terms how a standard or process will evolve. The SoC goes further by integrating multiple components of a complete self-contained system (such as a mobile phone), all on a single chip substrate, with the same advantages of an FPGA of low manufacturing cost and energy efficiency, combined with some programmability.

One trend uniting ASICs, FPGAs and SoCs is increased use of common logic components slightly confusingly known as IP cores ('confusingly', because the 'IP' refers to Intellectual Property rather than the Internet Protocol, although they are often both). This means that, for a small sacrifice in size and performance, an ASIC can be made substantially from common components that reduce its development cost. The distinction then between, say, a dedicated SoC and a generic processor comes down to the remaining existence of some custom logic in the former, while some of the SoC's IP cores may also be more specialised than could be justified for the latter.

When it comes to WAN optimisation there is still a role for dedicated silicon, and as before this tends to be at the cutting-edge of speed, with 10Gb and then 40Gb Ethernet transmission being current examples. The difference now is that dedicated hardware no longer tends to be required for the cutting edge in terms of functionality, which is why there has been a flight to software by vendors in the field, which to an extent now prefer to call their products WAN accelerators rather than WAN optimisers.

But WAN optimisation vendors - as we shall continue calling them for consistency - still need to be aware of the hardware architecture at a higher level, given that processor performance is now being stepped up by integrating multiple cores into a chip, rather than increasing the clock rate, which would consume more power and generate greater amounts of heat. This makes little difference for typical desktop productivity applications where the multiple cores make it easy to support multitasking, with processes assigned to different cores.

Some tasks though require the combined power of multiple cores, and this includes WAN acceleration across gigabit networks, but to do this the task has to be split up so that it can be run in parallel. If one task has to finish before the next one begins, then the whole process can only exploit a single core. So, as Blue Coat System's Li pointed out, WAN optimisation is reviving the field of parallel processing, in which an application is divided into multiple components that can be executed independently of each other.

"Sometimes you have to completely redesign the software stack to take advantage of multi cores," explains Li. "So there is renewing interest in old parallel processing algorithms." Originally such algorithms were designed to exploit multiple chips within large-scale architectures as an alternative to single processor supercomputers for high-performance computing applications such as weather forecasting and seismic exploration. Now for WAN acceleration the same principles are being applied with suitable modifications for multiple cores within a single chip (see box out left for a glimpse at how this has been done for Blue Coat's own ProxySG WAN accelerator).

Data reduction and virtualisation

The ability to exploit multicore architectures to increase WAN performance is one area where vendors are seeking to differentiate themselves at present. Another major area that has grown in importance recently is data reduction and de-duplication. This is a big field where compression plays a huge role, particularly for video, where the bit-rate required to deliver full high-definition pictures to a large screen TV is only 0.5 per cent of that consumed by the original pictures coming off the camera.

For enterprises the big drivers for data reduction are virtualisation, remote disaster recovery, and cloud computing, which are accelerating the rate of growth of network traffic and creating much larger pipes, requiring more sophisticated and higher performance acceleration techniques. WAN optimisation as a generic technology was first conceived in the 1990s, largely to deal with relatively low-volume branch office traffic as far as enterprises were concerned, but now has to address large links, increasing the need for, and benefits of, substantial data reduction.

"Robust data reduction is where the separation between different vendors starts," says Donato Buccella, CTO at Certeon, one of the leading WAN acceleration vendors, which like the others has recently made the transition from a hardware to software focus.

One of the new twists here brought by the growth in virtualisation and cloud computing is network de-duplication, to avoid as far as possible sending the same data more than once over a given wide-area link. This involves storing the same content at each end of a link, trading disk-drive capacity for bandwidth, and checking data before transmission to ensure it has not been sent. If it has been sent, then the system at the receiving end is informed that it already has the required content and can therefore replay it locally.

This does not necessarily involve dedicated processing, but to handle large volumes of data at wire speed increasingly now calls for solid state (aka flash) storage rather than hard-disk drives. "The way WAN'optimisation uses disks is akin to very'high transaction rate databases," notes'Buccella. "For high-end implementations we recommend SSD configurations." In practice this is usually Intel SSDs, according to Buccella, because they are better suited for high-end applications. "Most other vendors specialise in small devices for laptops."

Another big trend in WAN optimisation that has driven the change towards a software-based approach is virtualisation, which by definition requires products that are 'hardware agnostic' as the aim is to separate applications from the underlying platform. "We are seeing a strong demand for virtual WAN optimisation solutions that are easy to deploy on various hardware devices," reports Jeff Aaron, vice president of marketing at Silver Peak, another market in this technology. "Our Virtual Acceleration Open Architecture (VXOA) was designed to be completely hardware-independent, enabling it to run on any hypervisor for maximum flexibility."

As with the Blue Coat Proxy AG, Silver Peak has still had to be aware that the hardware will often comprise multicore chips, and may also have multiple processors. "VXOA [does have] the ability to leverage multiple underlying multicore processors," Aaron says.

Latency issues

One other big change in the WAN optimisation scene is the growing importance of latency. This has long been critical for broadcasters, pay-TV operators, and increasingly telcos (telecommunications companies), for delivering video when there is usually no point retransmitting a dropped packet because the video frame it pertains to has already been played. But with data centres becoming distributed across multiple sites and reliant on fast network communications for real-time or near-real-time processes apart from video, latency has risen up the agenda in WAN optimisation. Some of the functions involved in latency reduction are now done in software, including various TCP acceleration methods like 'read-aheads' and 'write behinds'.

These are used for protocols such as CIFS (Common Internet File System), formerly known as SMB (Server Message Block) dating back to the early days of PCs in the 1980s, where again memory is traded for bandwidth. In this case the aim is to anticipate data that may be required later, and transmit it at a time when there is plenty of bandwidth, or in batches to reduce requests and acknowledgements.

One latency tool that does still use dedicated silicon in some cases is Forward Error Correction (FEC). The immediate impact of FEC is to slow traffic down by inserting extra bits to buffer against signal loss; but the effect is to reduce the number of IP packets that need retransmitting because of errors, and therefore to cut overall latency. FEC is still a research topic as algorithms continue to be refined; but, whatever technique is used, dedicated hardware can be required for high-speed networking. FEC requires wire-speed operation, because it operates on the whole packet stream in real time.

More generally, dedicated hardware will continue to be required at the cutting-edge of speed, as Certeon's Buccella agrees. "The role of silicon is to trailblaze for the next iteration," says Buccella. "This means 10Gb, and also last mile interfaces tend to be silicon based, but the application layer will be software-based."

Over time, meanwhile, it looks like the distinction between dedicated and general hardware will be at the level of the board that goes into a rack rather than at the silicon level, as generic cores and components continue to proliferate. One example of this trend is the recently-launched Hitachi WAN Accelerator.

Further information

Share |

Faster WAN Terminology : optimisation becomes acceleration

Specialist vendors renamed their trade 'WAN acceleration' because the functions they used to provide under the banner of optimisation have been appropriated/commoditised by general-purpose routers made by Cisco and others. Functions include: generic traffic shaping, which deliberately slows down IP packets within lower priority applications to preserve bandwidth for critical processes such as real-time video; and TCP offload, which is the reading of TCP headers and processing of the protocol stack for error correction and retransmission, an overhead proportional to the network bit rate. It is still sometimes done in dedicated silicon, but generally routers use boards comprising general-purpose multicore chips, even if a board is dedicated to tasks that used to be within the realm of WAN optimisation appliances. The 'WAN acceleration' functions left to specialist vendors are those at the application level playing on data at the file rather than packet level, including compression and data reduction or de-duplication. At this level multi-core processors are quite fast enough, with performance depending more on the higher level infrastructure, according to Oliver Braekow, product manager, NG Firewall product, at optimisation vendor Barracuda.

"Architectures based on dozens of multi-core CPUs deliver performance in abundance and may be used flexibly for various purposes," Braekow says. "What we found in our testing on WAN optimisation performance is that the underlying storage requirements for the data to be cached dictate the maximum performance to a much larger extend than the raw CPU or system bus bandwidth."

Protocol stack management : Blue Coat revives parallel processing

Blue Coat Systems, and particularly its chief scientist Qing Li, have become pioneers of parallel processing for multicore chips, reviving and adapting old methods originally developed for supercomputers comprising large numbers of parallel processors. The fundamental challenge is the same, which is to break down a process into many components, each of which can be executed independently in parallel and then reassembled, either at the end or at various intermediate stages, without any process having to wait for a result computed by one of the others. Li has been involved in redesign of the routing for the open-source FreeBSD operating system, which is a UNIX derivative, although not allowed to take that name, widely used for operating systems by vendors including Juniper and even Apple, as well as Blue Coat itself.

The main task Li faced was to separate the processing of layer two and layer three within the TCP/IP protocol stack. Layer two deals with the individual links, while layer three embraces end-to-end paths between sources and destinations across a network, and in principle the two should be independent. After all a major reason for having a seven-layer protocol stack was so that each layer could be processed independently of the others.

However, TCP/IP layer two and layer processing have become entwined, ironically to increase efficiency; but now that we have multiple core chips, performance can be increased by severing these connections again, which is what Li at Blue Coat has aimed at. Previously execution of layer two and layer three proceeded as a single thread, so that the tasks had to proceed sequentially; now they can exploit multiple cores in parallel. Performance is increased even on single core processors but the gain is greater on multi-core chips.

Product focus: Cloud could lead switch to software

As if to prove there is still life in the old WAN appliance box, Hitachi has come up with a modern version with its WAN Accelerator. Even this might disappoint advocates of dedicated silicon, for it is implemented entirely in software on standard network boards, but it is at least dedicated to WAN optimisation in the cloud computing era.

According to Hitachi's director of network solution second operation, telecommunication & network systems division Hirofumi Masukawa, it is a response to the demand for "frequently updated big data", citing the example of a Hitachi customer, global automotive manufacturer Honda, which needs to transfer big 3D CAD data sets between continents as fast as possible day-by-day.

Hitachi believes that the trend towards increased use of caching, data compression and application prioritisation, does not suit all applications, many of which still require focus on throughput at the network level. Masukawa argues, therefore, that there is a need for a generic box that enterprises can plug-in to work alongside different application level acceleration techniques, to boost performance further.

"The Hitachi WAN Accelerator accelerates the session that is specifically picked up," says Masukawa. "We focus on optimisation of TCP, but not the application, because applications constantly evolve. For example, the new version of Microsoft CIFS has recently been optimised so that it can transfer data blocks in parallel, to fit the cloud computing environment."

Related forum discussions
forum comment To start a discussion topic about this article, please log in or register.    

Latest Issue

E&T cover image 0613

"Summer is on the way, so we turn our attention to a few leisurely pursuits - and some not-so leisurely ones..."

E&T videos

Tracking cargo across the globe

E&T jobs

Subscribe

Choose the way you would like to access the latest news and developments in your field.

Subscribe to E&T

E&T podcast

Tune into our latest podcast

iTunes logo