Commercial AI systems for facial recognition fail women and darker-skinned people, study finds
Image credit: Bryce Vickmark
A Massachusetts Institute of Technology study has found that commercial facial-recognition software can come with in-built racial and gender biases, failing to recognise the gender of the darkest-skinned women in approximately half of cases.
Joy Buolamwini, an MIT researcher and first author on a paper reporting the project's results, was inspired to carry out the research while working on an application for a facial-analysis program, which she noticed did not work reliably with dark-skinned subjects. Investigating further, she found that commercial facial-recognition software often did not identify her gender correctly, and sometimes failed to recognise her face as a human face altogether.
According to Buolamwini, one US tech company claimed that its facial-recognition system had an accuracy rate of more than 97 per cent, although this rate was the result of testing with a 77 per cent male and 83 per cent Caucasian data set.
In order to assess the extent of bias in these systems, she assembled a dataset of more than 1200 photographs, in which women were more proportionately represented, as were non-white faces. She then assessed the subjects’ skin colour from ‘light’ to ‘dark’ using the Fitzpatrick scale of skin tones (which is used to assess risk of sunburn). She then used three commercial facial-analysis systems – intended to match faces in different photos and determine binary gender, age and mood – to assess the gender of the faces in the dataset.
She found that all of the systems were more likely to misgender female and darker-skinned subjects. In a study of three different programs, they found that their error rates in determining the sex of light-skinned men was always 0.8 per cent or better, while the error rates in determining the sex of dark-skinned women were 20.8 per cent, 34.5 per cent and 34.7 per cent.
For the darkest-skinned women, the systems were so poor at determining gender that they may as well have been guessing at random (46. 8 to 46.8 per cent error rate).
“To fail on one in three, in a commercial system on something that’s been reduced to a binary classification task, you have to ask: would that have been permitted if those failure rates were in a different subgroup?” Buolamwini said.
“The other big lesson […] is that our benchmarks, the standards by which we measure success, themselves can give us a false sense of progress.”
The study raises questions about how neural networks – trained on huge data sets, often with human input – can reflect human biases.
“What’s really important here is the method and how that method applies to other applications,” said Buolamwini. “The same data-centric techniques that can be used to try to determine somebody’s gender are also used to identify a person when you’re looking for a criminal suspect or to unlock your phone.”
“I’m really hopeful that this will spur more work into looking at [other] disparities.”