Data management - will we ever press 'delete'?
Organisations are being swamped by terabytes of waste data
IT departments continue to retain large volumes of unstructured data on a ‘just in case’ basis
Arvind Krishna, IBM: "you definitely do need a policy to classify data automatically"
Michael Robinson, GreenBytes: “We have become inherent hoarders of data”
IDC figures indicate that conventional structured data is growing at a 32.3 per cent compounded annual growth rate
In our 'Cover Your Ass' culture we are storing more unwanted data than ever, and it's costing a bundle – so what steps can we take to generate less and delete more?
Whoever identified the only certainties in life as death and taxes evidently lived in a pre-digital age, where another incontrovertible truth is that we are all creating and storing more data every year. In this digital age data can be 'archived' en masse and on the cheap – and it's turning us into a generation of pathological hoarders. The problem is getting worse, both because we are generating and retaining more and bigger data, and because we have less time and inclination to remediate the situation.
Research firm IDC estimates that 15 petabytes of new (perhaps 'additional' would be a better way to describe it) information is generated every day: for example, that 988 exabytes of data was generated in 2010 alone, and that demand for data storage capacity worldwide will grow at an average rate of almost 50 per cent every year through to 2014. Just how much of this data do we actually need to keep, and are we deleting it once it has served its purpose? Michael Robinson, vice president of marketing communications at storage hardware company GreenBytes describes current attitudes to data retention, in both business and personal environments, as the 'Cover Your Ass generation'.
'We have become inherent hoarders of data – as long as you have a file why not keep it just in case,' Robinson says. 'You are not likely to delete something in case you might need it at some point, but 99 times out of 100, you don't need it.'
Why does everyone feel obliged store all this data? 'It's like mining for gold, there is a lot of dirt and rock, but people think if I look through it again maybe they'll find something,' thinks Arvind Krishna, general manager of information management at IBM. 'And with more and more mobile devices connected to the digital network, as well as data generated by sensors in machine-to-machine communications applications, the volume of data is increasing exponentially.'
Tony Reid, UK technical director at Hitachi Data Systems (HDS), estimates that 80 per cent of the data each of us generates ends up in a corporate network somewhere, and that the concept of capacity planning has become 'a thing of the past'.
Another compelling factor fuelling that growth is server virtualisation – when you provide server virtualisation to a business unit, and they discover they can provision a virtual machine themselves along with network and storage within seconds, they tend to do it even for trivial things just because they can,' Reid reports. 'Data lifecycle management was important and commonplace eight or nine years ago, but it failed because it relied on business units to make decisions about the data they owned and that was difficult if not impossible for them to do. Now they rely on the infrastructure to make those decisions for them, if they need more performance or capacity, for example.'
Users 'cannot be trusted'
Individual end-users could reduce this volume of data themselves but it seems they can rarely be trusted to take proactive measures. Steve Blayze is the IT director at UK financial services firm Smith & Williamson. The company backs up about 15TB of data a week, and Blayze says few users delete anything from their mailbox or personal file spaces, where volumes can range from just a few megabytes to multiples of gigabytes. Since he has worked with the company – over ten years – it has never once deleted any data, retaining it just in case its end-users might need it in the future.
'We try to push them to go through their email boxes and personal file stores, but we struggle to get them to do it unless we can get sponsorship from higher up the management chain,' admits Blayze. 'Otherwise they say they don't want to do it, they might need it, or they are too busy.'
In many cases, company employees at all levels are loathe to delete email or other files because they worry they will be required to produce them at a later date for compliance, auditing, or legal e-discovery purposes; or in support of some internal investigation charged with establishing culpability in some matter.
'Companies should have data retention policies, but they don't, or they ignore them because they worry about compliance, so it is very rare for people to delete data, not PST mail files or anything, going back ten years or more,' said Reid. 'As part of the acquisition process you see legal teams going through a company's old documents, financial records, compliance audits, for example, and look to examine the last ten years' worth of information supporting this company's business.'
The comparison with the storage and retention procedures of the past is apposite: once documents and records no longer had to be retained from a legal standpoint they were disposed of – to make space for more recent paperwork, if nothing else. Hard copies of records that did have to be retained over the longer term were stored off site ' at a price that meant that anything that did not have to absolutely be kept was weeded out.
Efficient technology = less data?
It is inevitable that the larger part of the responsibility for storing data more efficiently falls to enterprise IT departments; and while there seems little doubt that they will continue to retain large volumes of data on a 'just in case' basis, there are technologies and processes to help minimise the mass of information being retained, hopefully at less cost to their employers' pockets and to the environment.
'People don't delete things, and they cannot be told to do so. So they have a knee-jerk reaction, and put it in the data centre,' said Robinson. 'But that cannot go on without some intelligent storage management – cutting down on the number of spinning hard-disks, and not requiring as many SANs to store equivalent amounts of data, cutting down on cooling costs.'
'If we believe information is a capital asset we have to look at its lifecycle – how we discover what we have, then store it, archive it, delete it, and so on,' argues Arvind Krishna at IBM. 'But you absolutely cannot do it by using human support – humans cannot think about everything so you need a policy to classify data automatically, a system that says this type of data needs to be kept but after seven years it can be thrown away, or which every month gets rid of something and replaces it with new stuff, or keeps only the last three years of data.' Information lifecycle management (ILM) and hierarchical storage management (HSM) software helps companies set policies which regularly move data off expensive, power-hungry hard-disks and onto cheaper storage formats like tape cartridges or other removable media, often incorporating some form of information classification management or master data management process that finds out what the information is, attaches metadata for future identification purposes then indexes it to help staff manage it more effectively within data warehouses.
Using more powerful analytics software to identify the different types of data traversing wide-area and local-area networks, as well as the Internet, can also help IT department make more intelligent decisions about whether to keep that information before it is even copied to the hard-disk in the first place.
'We only tend to analyse data that is at rest and not the data in motion. It is critical that we analyse data at the front end and not the back end when we archive it. The preference is not to have to store any data at all ' look at the data as it is being streamed and decide whether you don't need it, identify copies or changes, and throw the rest away,' reckons IBM's Arvind Krishna, who acknowledges that these analytical processes handle some types of information better than others.
'Numerical data is easy, text data processing has seen huge advances over the last few years and there are lots of things you can do there, whilst speech can be turned into text and analysed, for example,' he continues, 'but it is not good for everything; facial recognition technology [for image and video processing] represents the boundary of what can be done.'
With company backups one of the biggest contributors to the mountains of waste data being created, regular testing of backup systems can flush out any information sets which are being copied and recopied to multiple destinations unnecessarily.
HDS's Tony Reid says: 'You need to make sure the governance and compliance around those files is set properly so when it is destroyed it is not actually sitting on a server somewhere. You also need to make sure that, whilst corporate servers can scale to accommodate content, they are not backing up that data every night.'
Robinson highlights data deduplication as one technology to help reduce data volume, combined with compression and caching technology to further shrink the amount of information backed up to the storage arrays. Deduplication works by storing only a single, rather than multiple, copies of the same file, with other locations being assigned a pointer which directs the application to the location of the master copy. 'Results depend on the type of file being backed up – a great deal of compression can be achieved with Microsoft Word files and email messages for example, but JPG image files are harder to break down.
Outsourcing storage to hosted cloud services may have a positive effect on how much data is kept or deleted. Leasing storage infrastructure on a pay-as-you-store basis, means that the more you store, the more you are billed for: this will force departments to take a much closer look at the information submitted for backup in a bid to reduce it ' and the bill ' accordingly.
'Some companies complain that backing up media files take hours, but in my view they should not be backing up that data at all,' says Tony Reid at HDS. 'I suspect most of that data could sit on a content cloud and just add metadata to it so you know what it is.'
'Does the enterprise delete data? The vast majority of organisations fail to even store it in the first place but [storage] technology can go on [innovating] for a long time as Moore's Law tells us. A lot of that data may end up being stored more efficiently in the cloud where more post processing will help collect and discard it,' concludes Steve Prentice, research analyst at Gartner. *
Further information
- www.ibm.com/uk/en/
- www.hds.com/uk/
- www.smith.williamson.co.uk/
- www.getgreenbytes.com/
- www.idc.com/
- www.gartner.com/technology/home.jsp
- eandt.theiet.org/magazine/2010/08/unstructured-data.cfm
- eandt.theiet.org/magazine/2008/17/generation-0817.cfm?origin=EtOtherStories
- eandt.theiet.org/explore/reports/chip-design/index.cfm
Helping organisations cut 'digital footprint'
Faced with having to store increasing volumes of SQL file data and VMWare virtual server images in 2009, Ipswich Borough Council took steps to reduce the storage capacity required by tape backups that were taking nearly 24 hours to complete by installing two Exagrid disk-based appliances at its data centre and disaster recovery site.
Exagrid uses a combination of last backup compression – effectively incremental backups – and data de-duplication technology to reduce the disk space required by ratios of anything from 10:1 to 50:1 depending on the information being stored, but which in IBC's case saw big reductions. Exagrid uses byte-level deduplication that stores only the smallest of changes from one backup set to another instead of larger block or file-level alterations.
'We are expanding our use of VMware and it was becoming a real challenge to put the images onto tape,' explains Howard Gaskin, IT infrastructure manager at Ipswich Borough Council. 'With the ExaGrid system, we are receiving 61:1 compression for our VMware images and our recovery times are far better.'
Though it did not get rid of the data completely, another user, the Royal College of Physicians was able to use Symantec's Enterprise Vault archiving to get rid of many of the PST files created by its Microsoft Exchange email system by archiving email over a given age or file size to an off-site storage area network – crucial in an organisation that sees 22,000 Fellows and Collegiate Members regularly send large image scans to each other using the office email system, and whose email 496 mailboxes were increasing in size by around 25 per cent every year.
'They were scattered on local hard-drives, laptops and servers, effectively taking up a significant portion of our primary storage for very little benefit,' says Christopher Venning, IT network and support manager at the Royal College of Physicians, who stops short of deleting the data completely. 'A significant proportion of the College's intellectual capital is stored in our email.'
The bad habits that are causing the waste data deluge
Bad file housekeeping. Storing out-of-date files which are no longer used clogs-up the local hard-disk and backup sets. Audit your personal file store every three to six months; if you find files you haven't accessed for two or more years, then you probably don't need them.
Failing to empty the recycle bin. Once you've safely archived all those pictures, music files and home videos to cheaper, removable storage media, delete them and remember to empty the recycle bin – the files will remain congesting the hard-disk otherwise.
Forwarding and re-forwarding attachments with every email. Every copy of that message, and all of the duplicated attachments, are stored on a server somewhere. If you reply with history five times including the original attachment, a 4MB file takes up 20MB in your inbox and another 20MB in every inbox you sent the messages to – if you need to keep the original file download it to your hard disk and detach the copy before you reply or forward.
Keeping emails forever. Do you really need that message sent five years ago to a colleague who has now left the company? Only keep messages that contain content which is still relevant or consider cutting and pasting it into a different file. Anything else either archive it or delete it.
The trail of digital underwear. Keeping sequential 'back up' copies of files you are working on a shared volume, then neglecting to delete them once the primary document has been signed-off.
Unnecessarily weighty files. It's really humane to understand the file format and reduce the file size where the option exists ' such as with Adobe PDFs.
Not uninstalling applications and operating system features you never use. Some programs take up gigabytes of space and others just a few megabytes, and also expand your Windows profile unnecessarily. If you don't use them at all, or seldom, consider deleting them.
Performing full backups when incremental will do. Backing up the entire contents of your file folder creates multiple backup sets which get progressively larger as more data is added to them over time. Do incremental backups that record only new files or changes to those already stored in the backup, and configure the software so it overwrites existing backups when they are out of date.
![]() |
To start a discussion topic about this article, please log in or register. |
|---|
Latest Issue
"Africa is abundant with engineering opportunity. We look at some of the projects and the problems."
News
Most viewed
From forums
- Isolation for repair of transformer feeder [06:09 pm 22/05/13]
- Wires numbering in Motor Control Cabinet (MCC) - continuous or not? [03:16 pm 22/05/13]
- "Contracts for Difference" in the Explanatory Notes to the Energy Bill [02:02 pm 22/05/13]
- Old LV Switchgear replacement Companies [12:54 pm 22/05/13]
- Delegated Powers Memorandum [12:33 pm 22/05/13]












