Retro couple surfing the web

File not found: Archiving the Internet

Archivists want to preserve the Web for posterity and halt the bit-rot as untended and reinvented pages collapse into useless error messages. We look into the dim and not-so-distant past.

'The Powerbook display is a bit temperamental,' Jim Boulton, deputy managing director of digital marketing agency Story Worldwide, warns me as we study a relic of a bygone age: a computer from the mid-1990s.

Dating back to a time when Apple was heading towards the corporate graveyard and was yet to get its infusion of cool from a revamped team, the chunky, clunky grey laptop looks older than its true decade and a half of service. A company, dedicated to preserving digital dinosaurs, has kept the machine alive. In turn, the machine is now pretty much essential to the software that's running on it.

The Arts Council-funded Antirom was developed as a reaction to the abuse of the term 'interactive multimedia' to refer to the way publishers repackaged image, sound and video on CDs, where the only interactive component was the menu used to select each file. The collective that put it together used Macromedia's Director. Like most software, it's been through many changes since it was first launched in 1988 – and is now supplied by Adobe. Those later versions won't run Antirom, so the pioneering package is now locked to a generation of hardware that has mostly shifted from the desktop to the skip.

Sitting alongside a copy of Wired magazine, published amid the enthusiasm for all things Internet and the rush to buy shares in Netscape, Antirom formed part of a temporary time capsule put on display in Shoreditch: the epicentre of London's web-development community.

The 'Page Not Found' exhibition held briefly in mid-November last year showcased websites that were influential in their design from the past 15 years. For a repeat later this year, the organisers hope to have a much more extensive display of important sites, ranging from the design nightmares and animated GIFs that pervaded old website aggregators, such as Geocities.com, to Vodafone's Future Vision site from just seven years ago, which has a far better reputation among web designers and which appeared in last November's show.

Exhibition double-quick

'Vodafone and the Ikea Dream Kitchen are quite recent, but everybody said 'wow' when they appeared,' says Boulton. 'These are sites that made everyone stop and take a deep breath.'

Boulton says the exhibition came together quickly: 'It was one of those conversations that you have over a beer'.

The problem was that it was already autumn by the time the idea was formed. To coincide with the UK's Internet Week, Boulton and colleagues had just six weeks to get everything together. 'We thought it wouldn't happen at all if we let this one go by and waited for a year to put something together,' he explains.

With 18 representative websites, it meant rescuing the guts of three old websites a week from backup CDs and dusty, unwanted computers. To recreate the feel of an Internet past, the Page Not Found organisers borrowed some old computers, including a NeXT workstation to represent the beginnings of the Web, from digital preservation specialist Binary Dinosaurs. Other old machines came from eBay, including a circa-2001 Aqua Mac G3 that had its own built-in time capsule: emails and other documents from Irene, its last careful owner.

Finding representative hardware was easier than obtaining famous websites. 'Some of it was just impossible. And some we haven't been able to get together in their entirety,' Boulton explains.

Dispensible publishing

The remaining content of some of the sites barely extends past the façade of the home page: links quickly end in 404 messages from the server. Some famous sites, such as Boo.com, which launched on a wave of hype before the collapse of the Internet bubble, seem to be long gone. 'We tried to get it, but nobody has got the code,' says Boulton. 'Technologists are not interested in looking backwards, but that's not very helpful when trying to get an archaeological site together.'

'Paper and stone are the only permanent media,' Boulton adds ruefully. 'Today, everyone can be a publisher, but that means that publishing has become dispensable.'

For decades, newspapers were no more than tomorrow's chip wrappers, but every single issue of a national newspaper published in the past couple of centuries has a place in the British Library archive. Search engines have made it harder for publishers to hide mistakes – even after an embarrassing story has been taken offline, it might sit for days in the Google cache, giving people with an axe to grind the opportunity to copy the material and store it themselves.

Yet, few Internet publishers make a serious attempt to keep older pages online. Even something as cosmetic as a site redesign can lead to the wholesale deletion of content, when the new look involves changes to the URLs or the database that holds the material. The BBC maintains older news pages, locked into older designs, but the broadcaster is in a minority.

British Library electronic archive

Helen Hockx-Yu, the British Library's web archive programme manager, is keenly aware of the difference in status of printed and electronic media. Newspapers, as well as books and magazines, sit in the library's archive, because publishers are bound by law to provide copies for free.

Hockx-Yu says the British Library has already built up a sizeable archive of electronic items. 'We don't just hold books and microfilm. We have on the order of one million digital objects – 130TB of data,' she claims.

But the library is limited in what it can store. 'We need the consent of website owners to capture their content,' says Hockx-Yu.

The Department for Culture, Media and Sport is currently consulting on new legislation that will make it feasible for the library to archive much more of the online material generated by UK publishers – whether professional or amateur.

'The .uk domain is one of the largest in world,' says Hockx-Yu.

The .uk top-level domain had eight million addresses registered by the end of 2009, according to UK registrar Nominet, putting the nation behind the US and Germany but ahead of most other countries. On those domains, UK companies and individuals have generated a huge body of data.

'We are looking at capturing the freely available portion of the domain: some 4.3 million objects,' says Hockx-Yu.

The library will use the same techniques as those used by Bing and Google to build their indexes of the Internet. These organisations have built software spiders that crawl the network of links that connect Web pages and other online resources together. Just for the UK, it's an immense task.

'It's 110TB of data each time we crawl. That's the scale of the problem and each time we crawl it's getting bigger,' says Hockx-Yu.

But like the newspapers once seen as utterly disposable, storage space is being set aside for electronic media to avoid it being lost forever in a site upgrade or a hard-drive failure.

'We regard the preservation and long-term access to websites as a part of our national responsibility,' says Hockx-Yu. She points to the archive that the British Library has of the late MP Robin Cook's website as well as the videos assembled for sculptor Antony Gormley's 'One & Other' project that saw almost 2,500 people take their place for an hour each on the fourth, unoccupied plinth in London's Trafalgar Square. The original One & Other site was deleted in Spring 2010 but its content was captured by the library so it can still be accessed.

Wayback Machine

The British Library is not alone in trying to hold back digital tides that threaten to wipe older sites away without a trace. Since the late 1990s, the Internet Archive has tried to maintain a long-term presence for the core content that has appeared, and often disappeared from the Web. The Wayback Machine operated by the archive stores only the text of older web pages. The pictures need to be filled in by the original sites so if they vanish, the pages turn into unglamorous shadows of their former selves.

'The good news is that these archives exist. But the not so good news is that it's not so easy to get to them and use them. And we do not get the complete experience the way things are architected now. But the stuff that is missing may very well still exist on the Web. It may be in a different archive; it may still be on the original site,' Herbert van de Sompel of Los Alamos National Research Laboratory told computer scientists at a seminar in the US last year.

Van de Sompel wants to make it possible for future historians to surf from site to archived site as though they had travelled back in time with pictures, sound and video. But that is difficult with content that is moved into separately maintained archives.

'It's fun to look into the past and with the Memento project we aim to make it easy,' says van de Sompel. 'With the way that the architecture of the Web is defined, at any moment in time you can only get to the current representation of a resource. Today, you get a page. Tomorrow, you get a new page and the old one has gone entirely. The old representations – the one from yesterday, the day before, from a year ago – they are gone forever.'

Some sites do maintain a record of how they looked in the past. The most recognisable is Wikipedia, which maintains a complete log of changes for its encyclopaedia pages, so that editors can easily revert to older versions – a handy weapon against vandalism. The problem with Wikipedia, says van de Sompel, is that although you can see how a page is changed it is not the same or anywhere near as easy as surfing to the most recent version.

The plan with Memento is to give the server that hosts pages a sense of time and take advantage of a little-known part of the specification for the Hypertext Transfer Protocol: one that makes it possible for a server to Flash content to regular PCs but serve up regular HTML to an iPad or iPhone, which cannot run the Flash plug-in. Called content negotiation, a browser can ask for a page from a specific time and the server, if it understands the request, can pull one from the archive with the closest time-stamp.

'It is not a monolithic kind of archive but distributed as the web is,' explains van de Sompel.

The Mediawiki group that provides the core software for Wikipedia and other wiki sites has a plug-in that makes it possible to surf its sites using the Memento system. And users can install a plug-in, if they are using Firefox to navigate other sites through Memento.

Hockx-Yu says the Memento project shows promise for the future: 'Hopefully, when you use your browser in the future, we will be able to slide back in time and see the British Library in 1997, for example.'

Recent articles

Info Message

Our sites use cookies to support some functionality, and to collect anonymous user data.

Learn more about IET cookies and how to control them

Close