The roots of search
Web users now depend on the search functions to navigate their way to information: be it mission-critical market data or family history - but the mechanics of search technology pre-date the advent of the Internet.
Seek… and shall ye find?
Search might seem to be one of the key information technologies of the future. But it has its roots firmly in the past. Even the most advanced search engines owe a lot to the probability theory developed by reverend Thomas Bayes in the 18th century. And the key technology employed in many of the search engines running today borrow much from work on the first successful software built more than 30 years ago.
At the core of most search engines is the concept of word frequency: words that pop-up more regularly in a document tend to score more highly in rankings when a user searches. However, there are common words that are found frequently in lots of documents. Words like those are not very useful in finding the document you want.
So, the approach developed by Karen Spärck Jones and Stephen Robertson in the 1970s was to eliminate these words from the running, and concentrate on words that are not commonly found. By combining term frequency with inverse document frequency, as the key terms are known, you can calculate the importance of a word to the document.
More advanced techniques, approaches that autonomy chief mike Lynch calls 'keyword plus', refine the calculation based on where the text lies. If the key words are all in the title, then that document could be given a much higher rank to situations where the words are scattered throughout a large document.
Large documents can present near pathological cases for search engines. The worst ones, reckons Charlie Hull of Lemur consulting, are the enormous reports put together by councils. One way around that is to break the document down into sections or pages and then compute the relative term frequencies for those.
Another problem stems from the optimisations that search engine builders use to keep the size of the index under control. There are words that you are going to find in all documents: they are useful to a person reading but no more than linguistic glue to the search engine. As a result, simple verbs and conjunctions become 'stop words' and are ignored by the indexer. This, unfortunately, has the effect of taking out phrases that might be important just because they are passages made up a stop words: 'to be or not to be', for example. The question is whether you would ever find it. "There are always pathological cases," admits hull. "The job is to identify them."
More recent approaches in search-engine design have computed vectors to try to better describe passages. Lynch reckons these do not work well: "Words modify the meaning of other words: something that is a linear vector doesn't work that well because it misses key points."
Microsoft's Live Search engine combines a number of techniques which are combined using a neural network. This neural network is meant to learn how techniques can best be combined to deliver what users want to find. There are other machine-learning approaches now in use. Some search-engine builders, such as autonomy, have gone back to probabilistic techniques, but in place of frequency have used algorithms that allow the machine to learn how text is structured statistically. These engines use Bayesian inference to process the text: computing the probability of one word following another.
"Building in linguistics can be important, but Bayesian inference does that, because it starts to understand that words modify words," claims Lynch. The problem with Bayesian inference is that the amount of data you have to store explodes even with relatively small documents. However, by squaring terms, the influence of words on others in computing the probabilities needed by the indexer dies away rapidly.
One alternative to probability is to start using natural language processing or build ontologies of terms and their meanings to provide the search engine with a better grasp of what it is trying to index.
"It's an interesting hypothesis that emerges quite frequently that statistics has run out of steam - to an extraordinary degree it has been proved wrong when put up," Robertson states. "For many information retrieval tasks you can get away without any natural language processing or ontology but you can't get away from having good statistics.
"It's like Moore's Law," he continues. "It has to run out of steam sometime, but it just hasn't happened yet. Rather than statistics running out of steam we are seeing more statistical models addressing more interesting questions that incorporate other ideas."