A web of research

Text mining and Web 2.0 techniques could provide a system for sifting mountains of biological data, providing new possibilities for collaboration. E&T reports.

Alfonso Valencia of the Spanish National Cancer Research Centre was surprised at the turnout for his talk at the Genomes to Systems conference held in Manchester last year. "We are beating metagenomics," he boasted, proud that a session on IT was scoring bigger numbers than one of the hot topics in biological research right now.

When you look at the mountain of research that currently looms in front of bioscientists, it is easy to see the attraction of what Valencia and others are working on: a system to pick the bones out of every paper published.

So-called high-throughput techniques, such as large-scale genome sequencing and lab-on-a-chip experiments have made it possible to produce masses of data. But eminent biologists such as Nobel laureate Sydney Brenner complain of a dearth of insight and theory in the science amid the push to publish more and more data.

"The number of published papers is enormous, and increasing all the time," says Jun'ichi Tsujii of the University of Tokyo, and the University of Manchester's National Centre for Text Mining (NACTEM).

Information Hyperlinked Over Proteins (iHop), developed by Robert Hoffmann at the Memorial Sloan Kettering Cancer Center in New York, is one step on the way. The Web-based system scours papers listed by the vast PubMed database of biology literature for references to proteins, and puts them into its own database of links. Researchers looking for information about a protein or the gene that contains its recipe can then search the database to find all mentions of it and - perhaps more importantly - data on genes and proteins that appear to be related to it.

"If I am working on a new problem in biology, then this provides an entry point," Valencia finds. "It does not solve all the problems of biology, but it is an interesting starting point."

Biology maps and crowdsourcing

Hiroaki Kitano, director of the Sony Computer Science Laboratories in Japan and president of the Systems Biology Institute, wants to go much further and make the literature part of an active, continually updated model of biology. His dream is of a 'Google Maps'-like tool for the processes that take place inside living cells. Kitano points to the electronics industry as the model for a system, called Payao. "There would be no electronics industry without circuit diagrams," he believes.

Biologists already have their own kinds of circuit diagrams. These are the pathway models that describe how enzymes, chemicals and genes interact as food is digested or the cells grow and divide. The pathway diagrams are enormous, complex and, unfortunately, ambiguous.

The Systems Biology Graphical Notation (SBGN) effort, which Kitano enthusiastically promotes, is an attempt to standardise the symbols used in pathway diagrams. Once the diagrams are in an agreed format, it should become possible to use them as a way of helping biologists navigate around the voluminous literature.

Kitano wants the diagrams to contain links to all of the papers that relate to each section. If a paper covers a reaction between a chemical, such as glucose, and an enzyme, it should pop up when the cursor hovers over the symbol for that reaction.

To kick-start the process, NACTEM is building a text-mining system to relate the papers to some of the most heavily covered pathways.

Tsujii says: "Our aim is to link the published work with the pathway, where information on the pathways and in published papers are integrated in a coordinated system. Pathways are seen as important work for integrating biological knowledge into coherent interpretations. Unfortunately, the pathways are not well linked to other resources of information."

Ultimately, the people behind Payao hope it will become a 'crowdsourcing' project where users keep it up-to-date as new papers come out.

Crowdsourcing is a neologism for the act of taking a task traditionally performed by an employee or contractor, and outsourcing it to an undefined, generally large group of people or community in the form of an open call. In the context of the NACTEM project, it may overcome one of the big problems of text mining.

Valencia says: "To find good interactions is easy; to find all of them is very hard. Most of the time, we're not really aware of the biological complexity of the project, such as finding the right name for the species."

Tsujii agrees, explaining that some 600 papers were mined to build the database at NACTEM for one biological pathway that contained 650 nodes and 440 links - "But you don't know if they are all the papers. And, importantly, new papers are published all the time and may mean that the pathway has to be revised accordingly. Pathway maintenance has to be considered carefully."

Spindle proteins

Hoffman and Valencia's work on text mining in the past couple of years focused on one narrow area. They aimed to find references to spindle proteins - these are proteins made by cells just as they are about to divide. The spindle proteins play a role, albeit uncertain, in keeping the two halves of the cell segregated as it prepares for the unzipping process that will create two cells out of one.

By concentrating on one apparently narrow area, the researchers could find out where the pitfalls in text mining lie. Even the comparatively structured world of the journal paper quickly turns out to be quite imprecise.

"Many of the papers catalogued were papers about mice. Biologists often pretend that the paper isn't about the mouse, because they think the work is relevant to humans. So, they don't mention that the paper is about a mouse. This is where text mining can make a mistake because the system doesn't realise it is a mouse paper," explains Valencia.

It gets worse, says Hirschmann. Even the terms change meaning, get confused with other words or split into synonyms. She explains that another term in the literature for esterase 6, the shorthand for an enzyme called carboxylic-ester hydrolase-6, is 'EST-5'. And there is a slightly different protein called esterase 5, but that gets shortened to 'Es-5'. It's hard enough for human beings to keep up with those subtleties, let alone an algorithm.

Hirschmann says there are problems with such rapidly evolving areas as the biosciences, particularly when it comes to developing ontologies to describe the many technical terms these papers use. "There is concern that an ontology built for 2008 won't be good for 2009," she says. "We need to figure out a way for these structures to modify over time. It is a very interesting problem."

To deal with the problem for their work on spindle proteins, Valencia and colleagues built a positive and negative set of key references. "We built a text-mining gold standard for spindle proteins," says Valencia. With the negative set to refer to, the software can strip out references that look connected but are merely coincidental.

Crowdsourcing could help fix mistakes made by the machines. But the approach advocated by Kitano has worried some researchers. If the system is too open, it could attract vandals who mess up the connections and link to irrelevant or misleading papers. And if the system is too regimented, it may miss important leading-edge research.

Sobia Raza of the University of Edinburgh points out that the literature, particularly in the early development of a field, is contradictory. Which paper takes precedence? And do you wait until at least two papers have been published on a subject before deciding that a piece of the diagram has to change?

The move towards text-mined knowledge bases could change the way that scientists publish their work, Valencia claims. The authors could tag the abstracts with metadata that describes their work that the computer can use to more accurately classify it. Hirschmann can see problems with this: "It could produce very uneven annotation quality and could actually increase workload overall," he worries. "It is a question to put to the authors and the journals. And the journals are lukewarm about it."

The information need not just be words and images. Chris Sander, head of the computational biology centre at the Memorial Sloan Kettering Cancer Centre, sees the iHop system and another search engine called Pathway-Commons, which "aspires to be the Google of pathways" as being hosts for raw data that other scientists can tap into.

"We want to capture not just facts from papers but directly from the authors the data they used. It has been done with other databases and we are doing prototype software for that. It will happen in the same way that open-access publishing has happened. It will help make biological knowledge computable," claims Sander, who aims to build a genome atlas of cancer: a tool that could determine the best treatment for a patient based on their genetic makeup.

By attaching different sets of experimental data to the diagrams, the researchers envision being able to perform meta-analyses in the computer to help make predictions and provide clues as to what new experiments are needed to understand a problem. Ultimately, they will be able to assemble detailed computer models of entire pathways that could drive insights into disease and the treatments for them.

Sign up to the E&T News e-mail to get great stories like this delivered to your inbox every day.

Recent articles