My Way: British Library
Digitising the British Library's books is a daunting task. E&T meets the man who's managing the project.
E&T: What are the origins of this project?
Neil Fitzgerald: The library is engaged in various digitisation projects across different types of material, working with commercial partners in many areas. There were discussions, back in 2004/2005 with the commercial parties who were interested in large scale digitisation, and the library felt Microsoft was the most appropriate party to proceed with on this project.
E&T: Why so?
NF: Microsoft was obviously interested in structured information which improves search and user experience on the 'Live Search' website, and for us it's really helping to fulfil some of our strategic priorities, such as building the digital research environments, and the digitisation of the collection for resource discovery purposes.
E&T: What does the project entail?
NF: The parameters are to digitise up to 25 million pages of printed books, up to 1900AD, which is the copyright cut-off date. Initially the selection will be based on English literature, but we'll broaden the subject scope. It will be keeping me busy for a few years, I think.
E&T: You need different approaches to different types of material, presumably – not just printed materials?
NF: The National Sound Archive, for example, is based within the British Library as well, so you obviously need a different approach when you're digitising sounds than you would with books, or newspapers. You need a specific technical solution, and the user interface has to be different to reflect the type of material that you're working with.
E&T: So what kinds of solutions are you using?
NF: Mass digitisation on this scale is a relatively new area. The Library's been involved in digitisation projects for possibly 10-15 years now, but specifically in digitising the iconic, unique items like manuscripts. Most of our experience has been gained in this area, you can take some of those lessons you've learned in that process and apply it to mass digitisation, but some of the techniques and approaches just aren't scalable.
E&T: How do you mean?
NF: It wouldn't be appropriate to use the same approach that you would use on digitising one or two items on 100,000 items. It's been a combination of using past experience and working with Microsoft and their digitisation contractor to come up with new solutions for some of the issues that mass digitisation presents.
E&T: How does the digitisation process work?
NF: We used to use a fully manual process for digitising, so we would have a traditional photographic approach, and produce extremely high quality files which look very good, but would take a long time to actually produce. Whereas with the volumes we're dealing with, we obviously needed to use more automation, and produce files at a much quicker rate. We're now using what we consider to be a semi-automated approach using scanners with robotic arms.
E&T: Have you had any issue with handling the old volumes and 'merging' books with technology?
NF: Historical material is not standard in any way shape or form, unlike modern books. With modern materials using this process you could put the book on the machine and allow the machine to run in fully-automated mode. Historical material has fold-outs, pages of different sizes, different paper types and different binding styles. So there are various complications in the process, but methods we're using allow us to scan 50,000 pages a day, or about a million pages a month; so it's quite a significant volume that we're producing on a daily basis.
E&T: What are the challenges with the optical character recognition (OCR) software?
NF: Printing wasn't standardised until the late 1800s, so different typefaces and fonts and printing styles were used in material up until that date. Some of those are more challenging for the software to deal with than other typefaces, so the OCR software can struggle with it. One of the ways we mitigate against that is to actually embed the OCR text into the page image. So you search the full text, and it actually takes you to the page image. Therefore, if the OCR isn't 100 per cent correct you can read the actual page rather than the embedded text.
E&T: What about recognition of specific typefaces?
NF: There are several OCR companies out there working on [this]. They can deliver custom modules, which allows the OCR software to deal with them on a much better basis. In future, if we actually come across a whole section of books that have a particularly challenging typeface, and there's enough of a critical mass to make it worthwhile, we will work with other parties out there to develop an improved solution. It's in the library's interests and within the commercial sector's interests to deliver the tools that are needed to digitise this material.
E&T: And how are you digitally storing these?
NF: They're digital images of the pages, which are sent to a processing cluster, which churns through the data and produces a set of deliverables. Within that set of deliverables there will be a master file which we will keep for long-term digital preservation purposes, and then there will be an access file which we will use to provide the interaction with the digital surrogates.
E&T: What happens to that?
NF: We'll link that file to our catalogue, so we have an online catalogue where you search via different search terms – keywords, or if you know the name of the author or the title. We will then link the digital surrogate image files to the catalogue records, so you'll be able to click a button just next to the appropriate catalogue entry, and it will deliver the digital surrogate to your desktop. This part of the project is under development, but we expect to have that up and running in the first half of this year.
E&T: Is this a bespoke system being created for that one purpose?
NF: We're using existing systems. The Integrated Library System (ILS) is the catalogue system that everyone who uses our collections has access to. We're just adding additional functionality to the existing system, as there's no point reinventing the wheel if we have a method that everyone is familiar with. In the future there's quite a lot of talk within the Library and the information world about Web 2.0 or Library 2.0 tools, so we may add layers of functionality on top of that, but not yet.