Hard drive storage graphic

Data storage technology rethought

Storage technology is getting a rethink. It might seem straightforward - but it isn't. Alongside the pros and cons skirmishing between hard-disk and solid-state drives, new challenges are coming from data-intensive applications and the need to analyse massive data sets.

The biggest single recent change in enterprise storage has been the growth in use of flash memory both on its own and in solid state drives (SSDs), but it has not always delivered fully on its promise. This is partly due to a mistaken belief that SSDs are inevitably more reliable than HDDs, but storage vendors are also culpable through over-hyping the products and failing to convey accurately how and where they should be deployed.

Most vendors have now woken up to these mistakes and this has driven innovation in techniques for organising hierarchical storage tiers so as to make the best use of both HDDs and SSDs to optimise the balance between capital cost, density, performance, and – increasingly importantly – energy consumption.

Enterprise storage requirements, meanwhile, have been evolving and adding new demands to storage architectures, with the rise of Big Data one of the most headline trends. Big Data has not suddenly sprung upon an unsuspecting IT function but has emerged incrementally through growing demand for intelligence derived from multiple sources of both structured and unstructured data.

Also, Big Data means that lots of data of different types has to be assimilated across the storage hierarchy, such that it is all readily available for high-speed analytics processing when required. This trend itself is creating demand for optimised storage hierarchies that enable rapid processing of large amounts of data from diverse sources.

The key point to note, whether talking about Big Data specifically or enterprise storage in general, is that SSDs should be deployed judiciously, and not ubiquitously. On the one hand it should not be ubiquitous where high performance and low latency are required; but equally SSD, or at any rate flash storage as cache, can have a place throughout the hierarchy, including lower layers towards archiving.

'The advent of Big Data means lots of high-capacity, low-dollar storage that can be migrated to fast disk and in-memory databases to analyse as necessary,' explains Blair Parkhill, vice president at X-IO Technologies, a vendor of integrated SSD/HDD storage systems incorporating software for optimising organisation and data placement within hierarchies. 'The advent of Hadoop for the cataloguing of all the input data is now helping.'

Hadoop is open source software designed to distribute data across commodity storage clusters, supporting failure detection and prevention at the application layer. It has become almost synonymous with Big Data, but not because it magically enables powerful analytics to occur. What it does do is help provide the underlying flexibility, robustness and performance at low cost by enabling use of commodity servers, with the ability to add these on demand, irrespective of what vendor they come from, or of what operating system they support – at least, that is the theory.

As Parkhill points out, Big Data does not itself call for greater use of SSDs, with organisation and optimisation being critical. 'Going all-SSD is overkill in almost all instances, because Big Data is about analytics, and analytics of the relevant data at that time,' he argues. 'We believe enterprise HDD or a hybrid of a small percentage of enterprise SSD and enterprise HDD gives the best IO density required at the lowest price.'

It is worth remembering though that Big Data is work in progress rather than a challenge that has now pretty much been solved, with further innovations in storage organisation still required. This is the view at HP (Hewlett-Packard), which acquired the UK software company Autonomy in October 2011 for £7.1bn to pursue its strategy for Big Data analytics. This acquisition became embroiled in controversy when HP subsequently claimed that it had been victim of irregular accounting practices, and became engaged in public slanging with one of Autonomy's founders, Mike Lynch.

This inevitably disrupted HP's assimilation of Autonomy technology, but it has nonetheless come out with IDOL (Intelligent Data Operating Layer), designed to extract meaningful information from various unstructured data sources, such as email, social media, audio and even video, as a prelude for analytics. Among other things it integrates with data structured using Hadoop so that it will fit well into many emerging storage hierarchies optimised for Big Data analytics.

Chris Johnson, vice president and general manager for HP Storage, admits, however, that there was demand for yet higher levels of performance around Big Data, and that Autonomy technology would play a big role in meeting this requirement. 'Big Data requires huge information resources that can be intelligently exploited for business advantage,' Johnson declares. 'Combining storage technology with Autonomy could be an interesting innovation in support of Big Data for the future. Watch this space.'

HP's idea here is to integrate the processing of unstructured data into a form ready for analysis into storage systems, which would greatly speed-up analytics applications and could be of benefit where near instantaneous decisions need to be taken in response to changing events in the field. This could for instance enable online or TV adverts for example to be targeted to individuals on mobile handsets on the basis of their immediate activities or location, rather than relying largely on historical analysis of known preferences as tends to happen at present.

While sophisticated data extraction has yet to be incorporated in storage systems, low-level processing, such as de-duplication to eliminate redundant data, already is. Specialist storage vendor Pure Storage performs in-line deduplication alongside data compression to cut data volumes before writing to its solid-state-based FlashArray. The company claims this can easily reduce the amount of data that has to actually be written by a factor of five, cutting the cost of flash storage.

Data de-duplication also has a role to play further down the storage hierarchy even though costs per bit are lower there, according to HP's Johnson. Furthermore, costs can still be saved lower down through use of tape-based technology. 'The storage of data throughout its life needs to incorporate high-performance and high-cost media, but also de-duplication to low-cost back-up appliances and further onto tape, which still has a role to play,' says Johnson.

Many data centres will therefore have three basic grades of storage – flash/SSD, HDD, and tape – with sub-divisions between them; and the challenge is to optimise the balance between the three to ensure that performance targets are met without over spending. This takes the storage strategy up to a new level of technological complexity.

Until now there has been a tendency within major IT projects to over-provision storage to cover-up for inevitable shortcomings in the management of the tiers. The process of storage provisioning is complex in any case, having to meet targets for capacity, performance, cost and disaster recovery, which can conflict with each other. This has meant that more storage capacity than is needed tends to be deployed and also too much of the higher grade, more expensive units, such as SSDs, as already noted by X-IO's Parkhill.

Software management trend

Remedies are emerging under the cloak of the 'Software-Defined Data Centre' (SDDC) proposals, with the overall objective of building on virtualisation to separate management and provisioning completely from the underlying hardware. SDDC is an architectural approach to IT infrastructure that extends virtualisation concepts (such as abstraction, pooling and automation) to all of a data centre's resources and services, aiming to achieve 'IT as a service'.

The overarching objective here is to enable capacity to be added as the demand requires, for individual projects or load increases, by dropping in storage, networking or processing units separately. Applied to storage, where Hadoop is playing a role, the principle is the same in that it should be possible to add SSDs, HDDs or tape systems, separately as needed, from any hardware vendor. However, in the practice of IT, principles are not the same as practicalities.

There is an irony here in that the term 'Software-Defined Data Centre' was largely coined by virtualisation market leader VMWare, now owned by storage vendor EMC, which has a vested interest in encouraging sales of its own hardware. This at first seemed to have led EMC to de-emphasise the value of software-defined storage because it would enable its customers to incorporate systems from other vendors; but the company has now realised that the game is up and that the data centre world is moving inexorably towards commoditised hardware under the umbrella of virtualised multi-tier management.

Against this background, where storage, computation and networking are separated within the SDDC, each of the three has to pull its weight in meeting overall demands, which are increasing all the time. While CPU performance continues to keep pace with Moore's Law, and network bandwidth has expanded at a similar rate both internally and over the wide area (through increased deployment of fibre among other things), storage systems have tended to lag behind.

In terms of capacity HDDs have kept up quite well even given the much-mooted Big Data boom, but they are falling behind in access speed, as was pointed out by Laurence James, products, solutions and alliances manager at storage and data management company NetApp. This is where flash storage comes in, with NetApp specialising in deploying this throughout the storage tier to ensure that performance targets are met, while avoiding over spending on it. Flash memory can be deployed as cache memory in front of HDDs right across the data centre, but particularly where high-performance is needed, to ensure that read and write times are not just sufficiently fast but also consistent, which is just as important.

'With the introduction of flash technologies, intelligent caching is key to ensuring the active data resides in the most appropriate tier,' says James. 'Automation is a must-have feature here and NetApp have a portfolio of flash-based products designed to optimise workload performance at the Server, Storage Controller, and Disk Array. For those workloads that require consistent low latency and response times, such as OLTP (Online Transaction Processing), all Flash Arrays such as NetApp EF540 are increasingly in demand.'

As James points out, different grades of flash are now available at varying price points, performance levels and lifespans. This needs to be taken into account when evaluating flash-based options, given that its big advantage is not just lower latency, but the fact that performance is consistent, as is its lifespan. HDD failures occur more randomly at an increasing level of probability with age and use, while for flash endurance is much more predictable, but does vary between the different grades.

'Flash has a much-improved failure predictability than older mechanical hard disk technologies,' James says. 'The challenge is that, depending on which type of flash is deployed, each has a defined endurance related to the number of program/erase (P/E) cycles per cell. Beyond this number of P/E cycles the cells become unreliable.' Where durability is the main requirement, SLC (single-level cell) flash might be preferred, while eMLC (enterprise multi-level cell) flash would be chosen where cost and capacity are more important.

Storage preferences are governed by differing application requirements, whose needs may need to be balanced within a shared SDDC. OLTP, as already noted, demands very low latency for large numbers of transaction involving individually small amounts of data, for which flash option is best suited. But this may not be the case for, say, a TV broadcaster playing-out video via a scheduled service. Although delay must be minimised, access latency will not be an issue if the video data is played-out sequentially, as is also noted by X-IO's Blair Parkhill: 'SSD is not for video as it is highly sequential. When architected right, you can get a lot of performance from striping SATA drives – and a good caching algorithm,' he says. The situation is more complex for on-demand video accessed by people at different times, with support for rewind and pause as well, when flash has a role to play keeping latency down – but even then most of the data can reside on suitable HDDs,'which is important to contain costs, given the huge size of high-definition video files.

'Yes, maybe flash will be used for caching,'but in general UPS (Uninterruptible Power Supply) backed up RAM in servers, along with good enterprise SATA drives or regular large capacity enterprise drives, work best for streaming video,' says Parkhill. 'Servers with adequate RAM attached to good, dense, reliable storage that can handle many video streams at once allow for high reliability and are the key to the growth in media.'

Given that video is accounting for an ever increasing proportion of the ICT world's digital data, this suggests that contrary to some predictions HDDs are in little danger of losing out to SDDs in pure volume terms. At the same time, however, SSD will be the critical point for many high-performance applications, including parts of the video distribution chain, and will be the main focus of continuing research and development, as' with the Memristor development programme that is currently being undertaken by HP Labs. *

Further information

Sign up to the E&T News e-mail to get great stories like this delivered to your inbox every day.

Recent articles