Weaknesses in Google Flu Trends' predictive power carry important lessons for Big Data researchers, according to a new study.
GFT was created by Google in 2008 to estimate flu activity around the world in near real-time using aggregated Google search data and now covers more than 25 countries. According to network scientist David Lazer, a professor of political science and computer science at Northeastern University, USA, it is often held up as an exemplary use of Big Data.
But in February 2013, the web service made the headlines after journal Nature reported that it was predicting more than double the proportion of doctor visits for influenza-like illness (ILI) than the US Centers for Disease Control and Prevention (CDC).
According to Lazer, the case highlights two important lessons for researchers taking advantage of the huge amounts of data on social and human behaviour generated by Internet services such as search engines and social media sites – that Big Data is not a substitute for traditional statistics and that these services are not designed to produce data for scientific analysis.
“This is a particularly pertinent example, but I think that oftentimes there may be too much uncritical acceptance of what pops out of Big Data without thinking about how the data is generated,” he said, adding the caveat that he is a keen proponent of Big Data.
In a report due to appear in journal Science tomorrow, Lazer and his colleagues found that even a fairly simple forward projection of three-week-old CDC data, which is based on surveillance reports from laboratories across the US, did a better job of projecting current flu prevalence than GFT.
But they found they were able to substantially improve on the predictive power of either GFT or the CDC alone when they combined the GFT data with the most up to date CDC data possible, normally on a two-week lag.
Graph showing GFT's overestimation of flu levels compared to CDC data and a combination of the two
While incorporating the CDC data could have allowed GFT to avoid the headlines, according to the study, it is still no substitute for on-going evaluation and improvement of the system.
The authors blame the service’s growing inaccuracy over time on a failure to take into account algorithm dynamics affecting Google’s search facility – changes made by engineers to improve the service in line with its business model and changes to the way consumers use it.
So-called ‘blue team’ dynamics are the result of modifications to algorithms that alter the data returned, while ‘red team’ dynamics see users manipulating the data generating process to meet aims such as spreading rumours about stock prices or ensuring news about their product is trending.
But Google is not the only service susceptible to this problem, according to the authors – platforms such as Twitter and Facebook are constantly being re-engineered opening them up to the problem – which raises some general lessons for researchers accessing the data they create.
“It’s not a scientific instrument that’s meant to collect information about the state of the world,” said Lazer. “The challenge then, particularly for Big Data being collected for other purposes, is that it is being repurposed for something it’s not designed for. This repurposing issue applies to both scientific purposes and corporate purposes.”
While greater transparency from the services generating the data would be helpful, companies like Google are understandably reluctant to give away the secrets behind their algorithm, and Lazer thinks that even if they were willing they would not be able to shed much light on the issue.
“I think even the companies can’t fully understand what’s going on in their own algorithms because they are too massive and complex,” he said. “What you have to do as a scientist is build in an assumption that models will drift in ways that you can’t anticipate.”
According to the paper, to better understand how these changes occur scientists need to replicate findings using these data sources across time as well as compare their findings to other more traditional data sources.
For Lazer this raises another issue – the segregation between traditional applied statistics and the more computer science-orientated Big Data fields.
“If you drew a network diagram of how much interaction and collaboration there is you would see two very distinct communities,” he said. “Universities are trying somehow to create a synergy, but it’s tough because who gets control?”
Breaking down these barriers is one of the major challenges for the data community, according to Lazer, because only by combining insights from both fields will researchers be able to get a more complete understanding of the world.
“Sometimes a very small high quality sample will just be way better, sometimes it will be the other way round,” he said. “But I think often a synergy will be the best way forward.”