MIT nixes training dataset containing thousands of racist and sexist slurs
Image credit: Dreamstime
MIT has removed and apologised for a popular dataset used to train AI systems to describe people and objects in images. The first examination of the dataset found that women and ethnic minorities had been labelled with derogatory terms.
The dataset ('80 Million Tiny Images') was removed after a study found that its images were labelled with troubling terms. The dataset had not previously been scrutinised.
The dataset was introduced in 2008 and contains almost 80 million 32x32 pixel photos (small enough for late-2000s and early-2010s image recognition algorithms to process) scraped from search engines and sorted into 75,000 categories.
Although this dataset has since been succeeded by better known and more high-resolution collections, such as ImageNet, it has nevertheless been used widely to teach machine-learning systems to identify and describe the people and objects depicted in an image. According to Google Scholar, the dataset has been cited more than 1,700 times by researchers.
A recent study by PhD candidates Vinay Prabhu and Adeba Brihane found that the labels in the dataset included terms like “rape suspect”, “paedophile”, “child molester”, “whore”, “bitch”, and with unnecessary expletives. Nearly 2,000 images were labelled with the n-word, including images of monkeys, and some images were non-consensual softcore pornography, such as upskirt photos and voyeuristic beach photography.
The researchers laid out their findings in a paper (“Large image datasets: a pyrrhic win for computer vision?”) submitted to a 2021 computer vision conference, which is currently undergoing peer review. They warned that neglecting to fix these troubling datasets can lead to real harm when tools trained on the datasets are deployed in the real world.
“The absence of critical engagement with canonical datasets disproportionately negatively impacts women, racial and ethnic minorities and vulnerable individuals and communities at the margins of society,” they concluded.
MIT has apologised and removed the dataset from its website, explaining that it had scraped the images without checking whether exploitative, explicit or otherwise inappropriate images had been collected. It stated that the images are so small that manual inspection, even if feasible, would not guarantee the removal of offensive images and terms.
“[The dataset] has been taken offline and it will not be put online. We ask the community to refrain from using it in future and also delete any existing copies of the dataset that may have been downloaded,” the statement said.
Professor Antonio Torralba, an MIT computer science expert, told The Register that the dataset had been constructed by obtaining a list of 53,464 nouns from WordNet (including the derogatory terms) then scraping the web for images using these words, before combining the sets.
The harm associated with biased datasets and practices has been thrust back into the limelight by the movement against police use of facial recognition tools. AI researchers have demonstrated that commercial facial recognition tools perform poorly at identifying women and people with darker skin tones.
This week, the head of Detroit Police Department told reporters that facial recognition tools identify the wrong person “96 per cent of the time”, after a black man, Robert Williams, became the first person to be wrongfully arrested on the basis of a facial recognition match by Detroit Police.
Sign up to the E&T News e-mail to get great stories like this delivered to your inbox every day.