Learning to find the right data
Image credit: Dreamstime
Getting good information into AI models is not easy.
It’s perhaps an indication of how much the pendulum has swung towards machine learning within the artificial-intelligence community that the concept of 'data-centric AI' seems practically a tautology.
Back in the dim distant past, otherwise known as the late 20th century, a lot of the AI work being done then focused a lot more on building systems from scratch that could reason for themselves about the world. Then along came deep learning and, though there are people still working on the reasoning-heavy style of AI, most of the attention has gone into the approach of showing computers pictures or descriptions of stuff and expecting them to learn to identify the stuff.
In 2012, Geoffrey Hinton, professor of computer science at the University of Toronto, and colleagues demonstrated a rapid advance in the ability of computers, thanks to the increased calculating power of GPUs, to label correctly a thousand objects pulled from the ImageNet dataset. Their deep neural network (DNN) easily outperformed the types of AI algorithms that dominated the two prior annual ImageNet challenges. Not long after, a group working at the IDSIA research institute in Switzerland used a DNN to outperform humans on the same task: recognising road signs. How so? The computer-based system was able to use subtle clues in shape and size to come up with answers for signs that had most of the original image bleached out by the Sun.
Since then, claims of machine outperformance have popped up at regular intervals interspersed with evidence of how DNNs cheat and are, in turn, easily cheated: often for similar reasons. As with the model trained on road signs, the neural networks frequently home in on details missed by human observers; not least because the human brain does not perceive images at the same level of detail as a computer fed a PNG. Subtle textures can be as helpful to the machine as anything, not least because a number of studies have pointed out that DNNs do not yet do a great job of pulling important features out of an image and associating them with a particular object. Much of the time, they alight on things humans would largely ignore when asked 'is there a computer in the picture?' or 'is the activity being shown one of cooking?'.
Several years ago, a group working at the University of Virginia noticed how DNNs would, perhaps not unexpectedly, put a lot more weight on things that turned up together more commonly in scenes than others. And those things would often correlate with stereotypes, largely because the images used to train the models were sourced from publicly available image databases, often with the help of search engines. So, datasets might contain twice as many women cooking compared to men and use those correlations to come up with the answer of what it saw when shown another image.
The result? The machine would make mistakes on who was cooking or inadvertently come up with the wrong answer based on the apparent gender of the person in the shot. This kind of 'directional bias' is one source of the problems that DNNs have when presented with real-world data and also helps identify a big problem with the current generation of machine-learning systems: the data they use to train is not good enough.
Very often, to get millions of images or other bits of content into the system and have them labelled, researchers have turned to crowdsourcing services such as the Mechanical Turk and Upwork. But the use of relatively cheap labour comes with hidden costs, not least with slurs and insults sometimes popping up in the labels attached by less than happy or poorly trained crowdsourcers.
Then you have gaps in the data itself. In a talk for the Center for Information Technology Policy at Princeton University last year, Olga Russakovsky, assistant professor of computer science at the same university, described how the western focus of many public datasets leads to mistakes in recognising things as simple as soap. Bars of soap are relatively rare in the US compared to liquid soap, so models, Russakovsky said, can fail to recognise them as soap. “A lot of these issues can be traced back to the fact that we're collecting all of this data primarily from the web because it's the cheapest most readily available large-scale source of data,” she added.
At a conference on data-centric AI organised by Stanford University in November last year, Cody Coleman, PhD candidate at the university, noted: “The unprecedented amount of available data has been critical to many of deep learning’s recent successes. However, big data brings its own problems. It's computationally demanding, resource-hungry and often redundant. But when we think about real-world data sets, they often skew to a small number of common or popular classes.”
The data-centric AI movement aims to solve this by paying a lot more attention to the data that get used to train the model and try not just to avoid wasted effort but skewing the results by presenting too many sources that represent more or less the same thing. One approach is to make machine learning a lot more iterative: where the data and model are tuned repeatedly to try to reduce errors. The question is how much of this can be automated. One example of a way forward is DCBench, which looks for signs of gaps or bias in the trained model and the data used to feed it, and use that to identify ways to fix the problem.
At the NeurIPS conference at the end of 2021, a team from Salesforce Research adopted a 'human in the loop' semi-automated approach to weed out problems in the training data and come up with additional rules the model could use. They found that the more conventional deep-learning approach, such as using adversarial data to try to get the model to learn the right patterns itself, turned out to be more costly than simply building rules into the model directly.
A decade after DNNs seemed to have offered the kiss of death to rule-based AI, it’s making a partially hidden return. The term 'data-centric AI' may turn out to be a bit of a misnomer in the end as model designers put more tweaks into their engines to deal with the problems caused by relying too much on the data itself.
Sign up to the E&T News e-mail to get great stories like this delivered to your inbox every day.