Beware of the gaps in Big Data
Image credit: Image Source
As we entrust ever more of our lives to ‘big data’, how can we protect against the gaps and mistaken assumptions used to handle the information?
When the municipal authority in charge of Boston, Massachusetts, was looking for a smarter way to find which roads it needed to repair, it hit on the idea of crowdsourcing the data. The authority released a mobile app called Street Bump in 2011 that employed an elegantly simple idea: use a smartphone’s accelerometer to detect jolts as cars go over potholes and look up the location using the Global Positioning System. But the approach ran into a pothole of its own.The system reported a disproportionate number of potholes in wealthier neighbourhoods. It turned out it was oversampling the younger, more affluent citizens who were digitally clued up enough to download and use the app in the first place. The city reacted quickly, but the incident shows how easy it is to develop a system that can handle large quantities of data but which, through its own design, is still unlikely to have enough data to work as planned.
As we entrust more of our lives to big data analytics, automation problems like this could become increasingly common, with their errors difficult to spot after the fact. Systems that ‘feel like they work’ are where the trouble starts.
Harvard University professor Gary King, who is also founder of social media analytics company Crimson Hexagon, recalls a project that used social media to predict unemployment. The model was built by correlating US unemployment figures with the frequency that people used words like ‘jobs’, ‘unemployment’ and ‘classifieds’. A sudden spike convinced researchers they had predicted a big rise in joblessness, but it turned out Steve Jobs had died and their model was simply picking up posts with his name. “This was an example of really bad analytics and it’s even worse because it’s the kind of thing that feels like it should work and does work a little bit,” says King.
Big data can shed light on areas with historic information deficits, and systems that seem to automatically highlight the best course of action can be seductive for executives and officials. “In the vacuum of no decision any decision is attractive,” says Jim Adler, head of data at Toyota Research Institute in Palo Alto. “Policymakers will say, ‘there’s a decision here let’s take it’, without really looking at what led to it. Was the data trustworthy, clean?”
Earlier this year a survey of executives by the Chartered Institute of Management Accountants found that although 37 per cent said big data had helped them make decisions, 32 per cent said it had actually made things worse and 80 per cent said a strategic decision was made based on flawed information at least once in the last three years. JP Morgan’s $6bn London Whale trading loss - partly the result of Excel errors in a financial model - demonstrates how costly the flaws of misinterpreted data can be.
Inefficient business decisions are a concern, but a bigger worry is similar problems bleeding into areas like civic governance, healthcare and employment: all areas that are increasingly relying on data-led techniques. A landmark paper in 2001 showed that legalising abortion reduced crime rates - a conclusion with major policy implications. But in 2005 two economists at the Federal Reserve Bank of Boston showed the correlation was due to a coding error in the model and a sampling mistake. This example pre-dates the era of what we now see as big data. These problems are not new, but they show the dangers of quantitative models of society.
Matthew Zook, a professor of geography at the University of Kentucky specialising in data-focused research, says many big-data techniques began in natural science and engineering. This is a background that doesn’t always confront people with the social origin or impact of data.
“A lot of the stuff I see is treating big data about society as if it’s a natural phenomenon,” Zook notes. He’s keen not to pigeonhole all data scientists - but he worries that an overly mechanistic understanding leads to overstating what’s being measured or overconfidence that it means only one thing. A tweet “could be seen as approval, it could be seen as disgust it could be tied up with some sort of social performance”, he says. In 140 characters it’s hard to be sure.
Smart cities concern Zook. The visibility of demographic groups on the Internet - the primary source of big data - is highly variable, as exemplified by Street Bump in Boston. This is well-known, says Zook, but nuance is often lost as techniques spread from scientists to policymakers and he is worried that people will take results at face value. This could see public resources pushed towards easily measured groups at the expense of the less easily measured. And it’s not only people that are hard to measure. “Once something gets measured and put into a model then everything starts optimising around that metric. If you don’t have something that’s easily quantifiable it won’t end up in the model - things like dignity, happiness, social responsibility,” he adds.
If not taken into account, unwitting biases both from the sources of data and their processing can lead to bad decisions. As noted in a 2014 US White House report on big data, at its worst the error can “deliberately or inadvertently perpetuate, exacerbate or mask discrimination”.
People may fall victim to overly simplified models that find correlations in data that unjustifiably tar people with the same brush. A report compiled by the US Federal Trade Commission mentions a credit card company that rated consumers’ credit risk based on whether they had used marriage counselling, therapy or tyre-repair services - because analysis of customers with poor repayment histories found these items cropping up more than those with good records. These correlations may be justifiable to a company’s bottom line, but will unfairly penalise many along the way even if the correlations are not flukes.
The focus on the bottom line leads to other business decisions that are difficult to justify socially. An investigation by the Wall Street Journal in 2012 found that online retailers in the US were displaying higher prices to customers in areas with less competition from brick and mortar stores. These tended to be poorer areas.
The cases illustrate the problem of using variables that acts as proxies for demographic categories such as income, race or age that are not directly available.
“That proxy variable may carry historical discrimination that’s not obvious to the person using the data,” says Mark Andrejevic, an associate professor of media studies at Pomona College and author of the book ‘Infoglut: How Too Much Information Is Changing the Way We Think and Know’.
The problem is exacerbated by automated data mining. An algorithm may not be designed to discriminate, but it may base decisions on variables strongly correlated with things such as race. A 2013 study found that Google was 25 per cent more likely to serve criminal record check adverts when searching for a typically black name.
Daniel McFarland, a professor at Stanford focusing on data-based sociology, says those using big data - particularly when it is about people - must remember it is often ‘found data’ skimmed from the internet rather than the carefully sampled representative datasets of the social sciences. The rates at which people use websites, whether an IP address represents one person or an entire office or how a website’s user interface guides interaction can all skew samples.
Found data isn’t unusable, he says; the limitations just need to be taken into account when designing analyses. “Big data is a little blunt so we need to be careful and think about the implications rather than just assuming it’s representative,” he says. Another key issue is that significance - a key statistical measure of validity in many disciplines - increases with sample size. “All variables will show significance with a large enough sample,” says McFarland. For big data studies, he says, more weight should be given to effect magnitude and relative size.
Social data is also not static. Workplace data analysts Evolv discovered in 2013 that people who kept their browsers up to date performed better at their jobs, but as soon as the metric was widely known it became useless as job-hunters altered their behaviour. This highlights another problem - a focus on correlation rather than explanation. This is often the result of trying to answer a narrow, results-focused question rather than comprehending the broader social systems at play. “My sense is that we’re more into an age of engineering where we chase quick results and solutions without really understanding the phenomenon under the hood,” says McFarland.
Andrejevic agrees: “The deeper approach to the world of trying to understand rather than just correlate falls by the wayside, because in the end all these big data directions push towards correlation, eclipsing explanation,” he says. “But explanation’s an important element of the human experience and one that has longer-term pay-offs than the purely correlational, predictive approach.”
Perils of popularity
This lack of understanding could be intensified by the growing popularity of deep learning, which uses artificial neural networks to automatically glean insight from massive datasets. The features they learn are not easily decipherable by people - the networks can tune into commonalities between images and text that humans will often ignore and lead to unexpected failures.
A 2014 paper by Google and New York University found pairs of images of objects that are almost identical to the human eye would be classified completely differently by the neural network. The network was apparently homing in on tiny discrepancies that were found to be important in the images used for training.
Other kinds of algorithms, such as decision trees, provide more readily interpretable processes, says Toyota’s Adler, but the power of deep-learning systems is seductive. Research into ways to decode their decisions is under way, but still formative. Adler says one solution is to break systems into multiple deep-learning components working on smaller more tractable problems. “You don’t want to be in a position where you have to say, ‘the computer said so’,” he adds.
But the complexity of some analytics systems means this is often the reality for all but the most skilled data scientist. Thinking about a correlation over hundreds of variables and millions of data points is like “trying to visualise 10-dimensional space”, says Andrejevic. “Because these processes are complex and opaque, because the technology evolves so fast, it’s difficult to subject it to public deliberation,” he adds.
Bringing statistics to bear on the results of big data analytics can help weed out bias, as happened with the Street Bump app. Last year computer scientists at the University of Utah, University of Arizona and Haverford College created an algorithm that tests datasets to see how easily protected categories such as race can be determined from proxy variables and another algorithm that automatically repairs datasets to prevent this from happening.
However, as decisions increasingly rely on analytics and automation this task becomes ever more daunting, says Andrejvic. “What I guess we need are automated systems of accountability checking. But you can’t automate everything. You end up with a process checking on a process checking on another process. Where does the buck stop?”
The problem of dealing with big data may be exacerbated by the growing trend for big data tools to be distributed far more widely than among data scientists. Based on the idea that increasingly advanced tools now allow anyone to perform analyses that were previously the remit of experts, 2016 has been declared the year of the ‘citizen data scientist’. As well as tackling a shortage of data scientists some believe the practice will increase transparency.
Most experts are sceptical though keen not to discourage these enthusiastic amateurs. Harvard’s King says they’re unlikely to be entrusted with enough to do much damage, but could contribute to the scientific process at the heart of big data. Even if their analyses are wrong they may put together interesting datasets or ask new questions.
As with all statistics, there is a danger of people twisting data to support an agenda, says Kentucky’s Zook. “Saying I ran a big data analysis on this issue is very powerful rhetorically.” But he thinks the ability of citizens to challenge official analyses can help balance potential misuse. The increasing adoption of open data by technology companies and governments makes this increasingly feasible.
But big data analysis is resource-intensive and citizen scientists will always be outgunned by large institutions. There may be little help from academia, says Stanford’s McFarland, as funding is aimed at new discoveries not validating previous work. He’d like data science to follow the lead of psychology, which has recently focused on retesting old research and finding much is not replicable, but he’s not optimistic. “Everything is moving so quickly we’ve all moved on by the time we realise something is not right,” he says.
Everyone E&T spoke to cautioned against throwing out the baby with the bathwater. Big data’s potential benefits to society are greater than the risks. And as much as it can incorporate bias it can also help challenge it, whether that’s Google using it to tackle diversity issues in its workforce or companies like LexisNexis creating alternative credit scores that help the traditionally excluded to access credit.
The data needs to be collected and handled in a way that tries to avoid biases that are sometimes hard to foresee. In 1952 the Boston Symphony Orchestra initiated blind auditions to help diversify its male-dominated roster, but trials still skewed heavily towards men. After musicians removed their shoes nearly 50 per cent of the women cleared the first audition. It turned out the sound of their high heels was biasing judges subconsciously. “Learning you’re biased doesn’t change your decisions,” says Harvard’s King. “It needs a procedure.”
Without solid guidelines or policy on these issues data scientists and engineers are being left to make calls outside their remit. “They are hungry for guidance in this area,” says Toyota’s Adler. He says the “geeks, suits and wonks” have been used to operating sequentially. Geeks create technology, suits make it successful and wonks manage the repercussions. But the pace of progress is pushing their worlds together, he says. “It’s not serial any more. Everyone needs to come together at the same time. They need to learn each other’s vocabulary.” *
Google flu tracker shows symptoms of unreliable data analytics
The poster boy for Big Data analytics failures is undoubtedly Google Flu Trends (GFT). Launched in 2008, the service promised near real-time tracking of flu prevalence based on analysis of Google searches for flu-related information.
It initially performed well, producing accurate estimates two weeks ahead of the US Centers for Disease Control (CDC). But GFT’s figures began to diverge significantly from the CDC data in 2012, peaking in late summer (see graph). In 2013 the system made the headlines when it became clear it had been predicting more than twice as many doctor visits as the CDC.
In a 2014 paper, Professor David Lazer of Northeastern University in Boston, Massachusetts, attributed this growing inaccuracy to a failure to take account of changing search dynamics - due to changes both in the search algorithms and in how consumers used Google.
Lazer warned that Twitter and Facebook similarly re-engineered their algorithms constantly, raising important issues for those analysing social-media data. “It’s not a scientific instrument that’s meant to collect information about the state of the world,” he said. “It is being repurposed for something it’s not designed for.”
Google stopped publishing Flu Trends in 2015, but still makes the search-signal data available to medical researchers.