In our 'Cover Your Ass' culture we are storing more unwanted data than ever, and it's costing a bundle – so what steps can we take to generate less and delete more?
Whoever identified the only certainties in life as death and taxes evidently lived in a pre-digital age, where another incontrovertible truth is that we are all creating and storing more data every year. In this digital age data can be 'archived' en masse and on the cheap – and it's turning us into a generation of pathological hoarders. The problem is getting worse, both because we are generating and retaining more and bigger data, and because we have less time and inclination to remediate the situation.
Research firm IDC estimates that 15 petabytes of new (perhaps 'additional' would be a better way to describe it) information is generated every day: for example, that 988 exabytes of data was generated in 2010 alone, and that demand for data storage capacity worldwide will grow at an average rate of almost 50 per cent every year through to 2014. Just how much of this data do we actually need to keep, and are we deleting it once it has served its purpose? Michael Robinson, vice president of marketing communications at storage hardware company GreenBytes describes current attitudes to data retention, in both business and personal environments, as the 'Cover Your Ass generation'.
'We have become inherent hoarders of data – as long as you have a file why not keep it just in case,' Robinson says. 'You are not likely to delete something in case you might need it at some point, but 99 times out of 100, you don't need it.'
Why does everyone feel obliged store all this data? 'It's like mining for gold, there is a lot of dirt and rock, but people think if I look through it again maybe they'll find something,' thinks Arvind Krishna, general manager of information management at IBM. 'And with more and more mobile devices connected to the digital network, as well as data generated by sensors in machine-to-machine communications applications, the volume of data is increasing exponentially.'
Tony Reid, UK technical director at Hitachi Data Systems (HDS), estimates that 80 per cent of the data each of us generates ends up in a corporate network somewhere, and that the concept of capacity planning has become 'a thing of the past'.
Another compelling factor fuelling that growth is server virtualisation – when you provide server virtualisation to a business unit, and they discover they can provision a virtual machine themselves along with network and storage within seconds, they tend to do it even for trivial things just because they can,' Reid reports. 'Data lifecycle management was important and commonplace eight or nine years ago, but it failed because it relied on business units to make decisions about the data they owned and that was difficult if not impossible for them to do. Now they rely on the infrastructure to make those decisions for them, if they need more performance or capacity, for example.'
Users 'cannot be trusted'
Individual end-users could reduce this volume of data themselves but it seems they can rarely be trusted to take proactive measures. Steve Blayze is the IT director at UK financial services firm Smith & Williamson. The company backs up about 15TB of data a week, and Blayze says few users delete anything from their mailbox or personal file spaces, where volumes can range from just a few megabytes to multiples of gigabytes. Since he has worked with the company – over ten years – it has never once deleted any data, retaining it just in case its end-users might need it in the future.
'We try to push them to go through their email boxes and personal file stores, but we struggle to get them to do it unless we can get sponsorship from higher up the management chain,' admits Blayze. 'Otherwise they say they don't want to do it, they might need it, or they are too busy.'
In many cases, company employees at all levels are loathe to delete email or other files because they worry they will be required to produce them at a later date for compliance, auditing, or legal e-discovery purposes; or in support of some internal investigation charged with establishing culpability in some matter.
'Companies should have data retention policies, but they don't, or they ignore them because they worry about compliance, so it is very rare for people to delete data, not PST mail files or anything, going back ten years or more,' said Reid. 'As part of the acquisition process you see legal teams going through a company's old documents, financial records, compliance audits, for example, and look to examine the last ten years' worth of information supporting this company's business.'
The comparison with the storage and retention procedures of the past is apposite: once documents and records no longer had to be retained from a legal standpoint they were disposed of – to make space for more recent paperwork, if nothing else. Hard copies of records that did have to be retained over the longer term were stored off site ' at a price that meant that anything that did not have to absolutely be kept was weeded out.
Efficient technology = less data?
It is inevitable that the larger part of the responsibility for storing data more efficiently falls to enterprise IT departments; and while there seems little doubt that they will continue to retain large volumes of data on a 'just in case' basis, there are technologies and processes to help minimise the mass of information being retained, hopefully at less cost to their employers' pockets and to the environment.
'People don't delete things, and they cannot be told to do so. So they have a knee-jerk reaction, and put it in the data centre,' said Robinson. 'But that cannot go on without some intelligent storage management – cutting down on the number of spinning hard-disks, and not requiring as many SANs to store equivalent amounts of data, cutting down on cooling costs.'
'If we believe information is a capital asset we have to look at its lifecycle – how we discover what we have, then store it, archive it, delete it, and so on,' argues Arvind Krishna at IBM. 'But you absolutely cannot do it by using human support – humans cannot think about everything so you need a policy to classify data automatically, a system that says this type of data needs to be kept but after seven years it can be thrown away, or which every month gets rid of something and replaces it with new stuff, or keeps only the last three years of data.' Information lifecycle management (ILM) and hierarchical storage management (HSM) software helps companies set policies which regularly move data off expensive, power-hungry hard-disks and onto cheaper storage formats like tape cartridges or other removable media, often incorporating some form of information classification management or master data management process that finds out what the information is, attaches metadata for future identification purposes then indexes it to help staff manage it more effectively within data warehouses.
Using more powerful analytics software to identify the different types of data traversing wide-area and local-area networks, as well as the Internet, can also help IT department make more intelligent decisions about whether to keep that information before it is even copied to the hard-disk in the first place.
'We only tend to analyse data that is at rest and not the data in motion. It is critical that we analyse data at the front end and not the back end when we archive it. The preference is not to have to store any data at all ' look at the data as it is being streamed and decide whether you don't need it, identify copies or changes, and throw the rest away,' reckons IBM's Arvind Krishna, who acknowledges that these analytical processes handle some types of information better than others.
'Numerical data is easy, text data processing has seen huge advances over the last few years and there are lots of things you can do there, whilst speech can be turned into text and analysed, for example,' he continues, 'but it is not good for everything; facial recognition technology [for image and video processing] represents the boundary of what can be done.'
With company backups one of the biggest contributors to the mountains of waste data being created, regular testing of backup systems can flush out any information sets which are being copied and recopied to multiple destinations unnecessarily.
HDS's Tony Reid says: 'You need to make sure the governance and compliance around those files is set properly so when it is destroyed it is not actually sitting on a server somewhere. You also need to make sure that, whilst corporate servers can scale to accommodate content, they are not backing up that data every night.'
Robinson highlights data deduplication as one technology to help reduce data volume, combined with compression and caching technology to further shrink the amount of information backed up to the storage arrays. Deduplication works by storing only a single, rather than multiple, copies of the same file, with other locations being assigned a pointer which directs the application to the location of the master copy. 'Results depend on the type of file being backed up – a great deal of compression can be achieved with Microsoft Word files and email messages for example, but JPG image files are harder to break down.
Outsourcing storage to hosted cloud services may have a positive effect on how much data is kept or deleted. Leasing storage infrastructure on a pay-as-you-store basis, means that the more you store, the more you are billed for: this will force departments to take a much closer look at the information submitted for backup in a bid to reduce it ' and the bill ' accordingly.
'Some companies complain that backing up media files take hours, but in my view they should not be backing up that data at all,' says Tony Reid at HDS. 'I suspect most of that data could sit on a content cloud and just add metadata to it so you know what it is.'
'Does the enterprise delete data? The vast majority of organisations fail to even store it in the first place but [storage] technology can go on [innovating] for a long time as Moore's Law tells us. A lot of that data may end up being stored more efficiently in the cloud where more post processing will help collect and discard it,' concludes Steve Prentice, research analyst at Gartner. *