
AI brings African languages to forefront of neural network model
Image credit: Adrian Ionut Virgil Pop/Dreamstime
Researchers have developed an AI model to help computers work more efficiently with a wider variety of languages – extending natural language processing (NLP) capabilities to African languages that are heavily underrepresented in AI.
African languages have received little attention from computer scientists, so few NLP capabilities have been available to large swaths of the continent. But a novel language model, developed by researchers at the University of Waterloo in Canada fills that gap by enabling computers to analyse text in African languages for many useful tasks.
The new neural-network model, which the researchers have dubbed AfriBERTa, uses deep-learning techniques to achieve "state-of-the-art” results for low-resource languages, according to the team.
It works specifically with 11 African languages, including Amharic, Hausa and Swahili, which are spoken collectively by over 400 million people, and achieves output quality comparable to the best existing models despite learning from just one gigabyte of text, while other models require thousands of times more data, the researchers have said.
“Pre-trained language models have transformed the way computers process and analyse textual data for tasks ranging from machine translation to question answering,” said Kelechi Ogueji, a master’s student in computer science at Waterloo. “Sadly, African languages have received little attention from the research community.
“One challenge is that neural networks are bewilderingly text- and computer-intensive to build. And unlike English, which has enormous quantities of available text, most of the 7,000 or so languages spoken worldwide can be characterised as low-resource, in that there is a lack of data available to feed data-hungry neural networks.”
According to the researchers, most of these models work using a technique known as pretraining. To accomplish this, the researchers presented the model with text where some words had been covered up or masked.
The model then had to guess the masked words. By repeating this process, many billions of times, the model learns the statistical associations between words, which mimic human knowledge of the language.
“Being able to pre-train models that are just as accurate for certain downstream tasks, but using vastly smaller amounts of data has many advantages,” said Jimmy Lin, the chair at the Cheriton School of Computer Science.
“Needing less data to train the language model means that less computation is required and consequently lower carbon emissions associated with operating massive data centres,” he added. “Smaller datasets also make data curation more practical, which is one approach to reduce the biases present in the models.”
Lin believes the research and model takes a “small but important step” to bring natural language processing capabilities to more than 1.3 billion people on the African continent.
Sign up to the E&T News e-mail to get great stories like this delivered to your inbox every day.