Mind the gap: Building inclusive AI for African languages

As AI tools such as large language models and speech-enabled applications become increasingly more pervasive and their capabilities grow, it is important to constantly take stock of where there may be gaps, consider the populations that are left out or negatively impacted by these gaps and subsequently take intentional strides towards filling them.
Language diversity, especially in state-of-the-art language tools, is still quite poor. African languages fall within a broader categorization of languages known as 'low resource.' This cluster of low resource languages refers to those for which there is little data available online, data that can be used to train models. It is not reflective of the living populations that speak these languages every day.
When efforts to build language tooling for African languages were still quite nascent, there was a lot less data than there is now for the languages. In thinking about the task of machine translation, which entails the automatic translation of text from one language to another, most parallel data (implying the presence of data in one language and the aligned direct translation in a second language) that did exist was from the religious domain, either translations of the Bible or other religious texts. These translations were done by religious organizations whose primary aim was evangelization, both in pre-colonial and post-colonial times.

But the use of this content in training machine translation models presented a challenge. When most of the training data used to develop a model comes from a single domain and in this case the religious domain, the resulting system will not perform well when faced with translation from other domains, such as news or everyday conversations.
Below are some examples of English to Kiswahili translations done by a machine translation system that was trained only on religious content. We find that certain words likely to be used in the Bible are arbitrarily inserted in various contexts.
[These translations are far from perfect, we simply use them to demonstrate that these words/concepts common in a religious setting are learned erroneously]
'Waumini' is 'believers' in Kiswahili.
- [original sentence] I'm Canadian, and I'm the youngest of seven kids.
- [MT system output] Mwamko wa Mkanada, na Waumini ni kitinda-mimba cha watoto saba.
- [direct translation of system output] Of Canadian origin, and believers in the last born of seven children.
'Kwa ukunjufu' translates to 'with joy'.
- [original sentence] Now you've seen before that with electrodes you can control robotic arms…
- [MT system output] Sasa kwa ukunjufu waonekana kabla ya hapo kwa kutumia elektrodi waweza kudhibiti mikono ya roboti…
- [direct translation of system output] Now with joy they are seen before here through the use of electrodes you can control the robotic hands…
'Ulozi' is 'witchcraft' in Kiswahilii.
- [original sentence] It just says "Money," and if you look down the side street, it says, "does not make me happy."
- [MT system output] Husema tu Pesa, ulozi, na ukiangalia upande wa barabara, hiyo husema, ulozi hauleti furaha.
- [direct translation of system output] It just says money, witchcraft, and if you look on the side of the road, that one says witchcraft does not bring happiness.
Yet even this flawed inclusion has been welcome in light of the seemingly absolute dominance of major Western languages not only in research, but subsequently in language support of end user tooling. Surveys done of the languages reflected in major NLP conferences as early as 2004 demonstrate extremely poor language diversity with the representation of English sometimes being up to 90%. Chinese and German are, respectively, the second and third most represented languages.
This focus on some languages more than others is a proxy for language representation in the datasets that are available. Building datasets is an expensive and time-consuming endeavor. Often, researchers interested in building tools will not invest in building datasets and rather proceed with building for the languages where some data already exists.
The current trends, where dataset building is concerned, involve crawling the internet for all available data and then curating subsets of these and packaging them into datasets. The process of curation is viewed as a research endeavor in and of itself and subsequently, other groups of researchers use the curated datasets to train models and develop language tools.

Common Crawl is an organization that, since 2008, has scraped content from the web and created snapshots of this content at least once a month. Their monthly dumps of web content are a popular source for many web-crawled datasets. Given this source data, a team of researchers seeking to curate a dataset for a particular task would undertake several steps.
First is language identification. Content published on the internet is in a multitude of languages and it is important to sift through each individual document to identify the language that the text is in.
There are models trained specifically for the task of language identification. Their performance varies for different languages, they are likely more confident in identifying English text versus the text of a low resource language such as Kiswahili, due to the fact that much more English data has already been curated and used in the training of the language identification models, a cycle.
Next is a pre-processing step: Cleaning or preparing of the data for a particular task or system. In the case of machine translation, the ideal data format should be aligned, from one language to another at sentence level.
As part of the pre-processing, each document is therefore split into individual sentences based on punctuation. Deduplication of sentences is then performed. These tasks are carried out by automated scripts.

Finally, for this data to then be useful for the task of machine translation, there needs to be alignment between a source and target language. This alignment is performed by a model that performs pattern matching and produces a probability that a sentence in one language is aligned to a second sentence in a different language, by virtue of being its translation.
In a setting where a multi-lingual dataset is being curated, each English sentence might be aligned to multiple sentences in different languages.
Given the likelihood of similarity score/probability that tells us these sentences are likely to be translations of each other, a dataset can then be curated for multiple language pairs; English to Kiswahili, Kiswahili to Turkish, Turkish to Quechua, despite there potentially not being any text that was manually translated by a human from one of these languages to another.
The result?
Well, the effort results in a higher volume of data determined as useful by the series of models used to curate it. The process may be viewed as more efficient than traditional dataset building when one considers the manpower required — a handful of researchers — and the cost put in versus the output produced — this could be millions, even billions of sentence pairs.
Yet the quality of data is sorely wanting, especially for low resource languages for which this data is most crucial. A manual audit of 205 language specific corpora which are part of web-crawled datasets found that at least 15 of these corpora had no usable text and 87 of them had less than 50% usable text. Majority of these quality issues had not been reported or investigated in depth.
A speaker of these languages is able to identify these gaps fairly easily but many technology developers will simply use this data as is to train models. How well a resulting system is performing will be determined by an output metric, the accuracy, which is a statistical score.
This metric says nothing of what the experience for real people using these systems will be and unfortunately, a round of human evaluation is not carried out, especially for low resource languages.
Even when there are reports of extension of capabilities to languages that are historically considered low resource, this inclusion is often superficial, with the performance being laughably poor. While it may not be marketable to do the painstaking, foundational work for meaningful inclusion of low resource languages, these efforts are what is truly needed to make a difference.

Kathleen is a researcher at the DAIR Institute. Her focus is in Natural Language Processing, particularly building speech recognition technologies for African languages, notably Kiswahili. Kathleen has been involved in efforts to build language datasets, experiences which have inspired a desire to explore data governance and licensing models that address power and resource imbalances. Kathleen works with African AI communities to enable ecosystem capacity building and relevant research. She continues to organise with communities as part of the Deep Learning Indaba, where she is a trustee and the Masakhane Research Foundation, where she is chair of the board of directors.


