Ideally, we'd preserve the world wide web and all its links in one gigantic, living database. But what if we could only pick ten? Which would they be and why?
1. The first Web page
What better website to archive than the very first – the pages that Tim Berners-Lee put together at CERN? But do you archive how they were meant to look or how most of the world saw them?
Berners-Lee used a NeXT workstation to write the code for the first working webserver and browser. The machine could cope more readily with graphics than most of the computers that would connect to the Web; they would mostly see a text-only version.
As hardly anybody outside Switzerland saw the original site, few realised that the first HTML-based pages supported more than raw text, a factor that led to the creation of stylesheets independently of the techniques that Berners-Lee used.
To a large extent, Wikipedia is self-preserving thanks to the log of edits that its underlying MediaWiki software provides – you can always rewind to an older version of a page to see what it contained if not exactly how it looked. However, that's not entirely true.
The Wikipedia editors delete a lot of pages about things because they contain too much promotion or the subjects just aren't famous enough. These do not make it into the Wikipedia archive.
Oblivion can be the same fate for some of the chatter from the 'Talk' pages, where editors discuss, and often bicker, over what appears on the main page.
Luckily, for anyone interested in lists of musical instruments that appear in computer game 'The Legend of Zelda', this important information has been hoovered up by the Deletionpedia maintained by Essex-based programmer and occasional Wikipedia contributor David Batley, who started the site in 2008.
What form of site on the Web has been more important over the past 20 years than a search engine? And for at least ten of those years, the pre-eminent search engine has been Google. But how do you archive something that changes on a day-by-day basis? It's not just the Web that's indexed by Google's Web crawlers that changes, but Google itself, as it frequently changes how it orders pages in a game of cat-and-mouse with spammers.
The company preserves a breadcrumb trail of clicked links and searches made by logged-in users. But to maintain a history of indexes is something else altogether and a way larger job. Once you factor in add-on sites such as video archive YouTube, you begin to wonder just how many servers it would take to preserve its content for posterity.
Facebook is a relative newcomer, but with hundreds of millions of people now signed up worldwide, the site will surely be a magnet for future historians. However, the potentially highly personal information on the site will lead to ethical questions for anyone who plans to preserve the data. For example, what do you do to pages where people have decided to quit Facebook? Should you keep them in an archive? Like data from the regular ten-year census, all the personal information might have to be stripped out or anonymised before the archive could be released.
5. Drudge Report
Few websites run by essentially one person have had the impact of Drudge Report. The right-leaning news aggregation site began as an email newsletter sent out to friends then posted on a Usenet forum. As it slowly turned from an entertainment into a political gossip outlet and began to carry links to news stories, the Drudge Report headed onto the web.
Its formula is more or less as it was in the mid-1990s when the report shot to fame by reporting first that Jack Kemp would be Bob Dole's presidential running mate against Bill Clinton. However, Drudge is most famous for being the first to report on rumours about Clinton's infamous relationship with White House intern Monica Lewinsky.
Because of the way that it links to news stories on other sites, the Drudge Report has acquired a reputation as one of the first blogs: the headline link is one of the familiar formats of short-form weblogs.
An archive of the report exists but only goes back to November 2001 in its current form.
In the decades to come, historians will wonder what mattered to users of the Web. It's not news for nerds, however. It's cat pictures. What could be more adorable than a kitten popping its head out of a box of paper tissues? Many internet users will see this explosion of ephemera on sites such as Cute Overload and I Can Has Cheezburger.
But the ultimate source of the weird meme is the far darker website 4Chan. Its users originally came up with the concept of 'Caturday' and the 'lolcat' lingo and a bunch of other fast-spreading Internet memes. It takes a strong stomach to surf through parts of 4Chan but no single site demonstrates the id of the Internet better.
Boo.com is a lost cause as far as preservation goes – and yet one of the best candidates as no other site represents so well the 'Burn Rate' days of the Internet. There were other famous and bigger set-ups that burned through their shareholder capital, such as Pets.com, Value America and Webvan. Boo.com, though, was the first to fall so hard – just weeks after actually starting trading – and largely on the basis of its site rather than an inability to deliver product, which afflicted many of the others.
The bad news for website preservers is the site is long gone. A threat to relaunch in 2007 seems gone, as the address now redirects to a listing of hostels.
News for nerds. Stuff that matters. That's Slashdot, as long as what matters largely relates to computers, with a smidgen of legal rights online. Like Drudge Report, it's mostly links to stories that appear elsewhere, but attached to an often vociferous user forum where participants get to affect reputations by 'modding' posts up, down or just marking them as funny.
A clue as to the composition of the Web's audience lay in Slashdot's ability to turn a webserver into a smoking heap by simply linking to an item on it. 'Slashdotting' a site is today not quite as intense as it used to be but the site still attracts massive traffic and remains one of the main sources of information for the technologically curious part of the online audience.
There is no 'skip intro' on 2Advanced Studio's site. You have to wait for the code to load to get anywhere but as the portfolio site for a design agency, this one is all about the Flash animations it contains, The 2001 version of the site topped a poll for the most influential Flash site of the decade.
Unlike all the other sites on the preservation list, the content is not important, but the way it's presented is. The visual style has spread throughout the Web, so that, although over-familiarity with the design elements mean later versions have never had the same impact. Luckily for design-history specialists, the company has maintained an archive of its former sites so they can see the evolution of form over function in online design.
Before search engines, there was the directory. The best guide to the Web was printed in the form of the Whole Earth Directory. The concept quickly migrated online, providing the basis for Yahoo's original format and for the Open Directory Project (ODP), originally titled Gnuhoo. Were we able to surf older versions, we would have an excellent picture of the way the Web evolved in its earlier days.