Making Machines Speak Yorùbá - Imminent - Translated's Research Center

Research

A report on the research funded by Imminent’s Grant for the collection of speech data for the Yorùbá language.

Index of content

Introduction
What was the issue to be addressed?
Project design
Challenges and innovations
Duration and result
Impact
Who are the team members?
Conclusions

Introduction

In early 2022, we were awarded a twenty-thousand euro grant by Imminent to work on linguistics data for Yorùbá, a language spoken in southwestern Nigeria.
At the time, there was only one Yorùbá language audio corpus online that was clean, modern, and usable. It was created while I worked at Google in 2019 and made freely available as an example of the possibilities of technology to aid underserved languages in their quest for visibility on the internet and in modern technologies.
It was also one of the only few audio corpora in an African language in existence, highlighting one of the biggest challenges that faced African languages in technology. While other languages from Europe and Asia like English, German, Korean, Mandarin, Portuguese, Arabic, etc had already made giant strides in artificial intelligence, machine learning, and even in translation, African languages continued to fall behind. Google Translate performed poorly in them, Siri and Google Assistant worked in many of these other languages but didn’t exist in any African language, and for a speaker of any of the languages on the continent of Africa, you still mostly depended on English to access the internet, and didn’t have much content in your own language to navigate the worldwide web experience.
But that Google corpus, as good as it was, was only five hours long, and barely able to do more than just illustrate the possibilities that exist, and inspire others in that direction. Being a part of its creation, I knew how much effort and resources it took to create, as well as the limits of its use. So I always retained some hope that we could replicate the work and create a larger corpora that can have real and practical world uses.
So when Imminent opened up its competition, we applied. And when we were selected, we got to work. This essay is about the two years it took to create a corpora of fifty hours of Yorùbá audio, the challenges we faced, the strides we took, and the possibilities that are now enabled by the result of our project, supported by this European organization.

Google Translate performed poorly in them, Siri and Google Assistant worked in many of these other languages but didn’t exist in any African language, and for a speaker of any of the languages on the continent of Africa, you still mostly depended on English to access the internet, and didn’t have much content in your own language to navigate the worldwide web experience.

What was the issue to be addressed?

The biggest issue was access to data. Like many African languages, Yorùbá — though spoken by about fifty million speakers — did not have a lot of publicly usable audio corpora. That qualification is necessary because not every Yorùbá text or audio is usable for technological purposes. Yorùbá is a tonal language, so the writing needs diacritical marks to disambiguate meaning. Most people who write the language on the internet do not have the tools to add diacritics, so they write without them, and the data is useless. Most Yorùbá audio available online are either from radio or YouTube channels with no parallel transcription, and are usually data including English or other language expressions, because many Yorùbá speakers who use technology are literate and bilingual, and often code-switch in everyday speech.
So any data that must address the needs of monolingual speakers must also be monolingual, and properly curated to deal with diacritics which are the core characteristics of Yorùbá orthography.
So the task we set out to address involved getting thousands of Yorùbá texts that used proper diacritics, checking them for errors and fixing them, before getting them ready for recording by Yorùbá language volunteers who speak the language fluently, are confident in reading curated texts, and are willing to work for a little honorarium and gift, which our grant facilitated.

Project design

To get data that was usable across many domains, like automatic speech recognition and text-to-speech, we needed a design that included multiple voices in multiple different scenarios. The voices must include males and females, young and old, in noisy and quiet places, and with different and varied types of Yorùbá expressions, syllables, phrases, and rare sentence and tonal combinations. We sat and designed this dataset. A part of it came from hundreds of news headlines in Yorùbá that needed to be cleaned up with appropriate diacritics, have offensive parts removed, and be rewritten to fit our sentence length plans. The other part came from sentences we deliberately crafted from scratch, to satisfy all the parameters already described above.

Imminent Research Grants

$100,000 to fund language technology innovators

Imminent was founded to help innovators who share the goal of making it easier for everyone living in our multilingual world to understand and be understood by all others. Each year, Imminent allocate $100,000 to fund five original research projects to explore the most advanced frontiers in the world of language services. Topics: Language economics – Linguistic data – Machine learning algorithms for translation – Human-computer interaction – The neuroscience of language.

Apply now

In the end, we had twenty-five thousand lines, which we needed to get the fifty hours of recorded speech. This, in the end, was spread across a hundred human volunteers each recording about two hundred and fifty lines. All 250 lines took between three to four hours to record in one studio session, depending on individual speed, fluency, and enthusiasm.
We set up a mobile recording booth, which was placed first in a private study that was insulated for noise, and later in an office at the Linguistics Department at the University of Lagos, where it remains until today to benefit students and scholars interested in creating their own language recordings and experiments. Volunteers came in one at a time to record their lines, supervised by sometimes one, and sometimes two members of the team. Errors were corrected on site, either with the pronunciation itself or with the lines that guided it.

Challenges and innovations

One of the things we discovered earlier in the project design was the absence of a usable application to elicit and record audio from volunteers. There were native recording apps like Audacity or GarageBand, but none of them was designed specifically for linguistics audio elicitation, where individual subjects can be assigned designated lines, and where all completed lines can be stored against each subject for later retrieval. Google, when I worked there, had their own native elicitation tool, but it has not been made public, so we knew early in the project that we would need to create one for ourselves.
One of our team members took on this task. In the early days of the work, we depended on Audacity to record the bulk of the lines, and then spent multiple hours later cutting and saving the individual takes in a designated folder. This was too much work that took precious time. With Iroro, the team member in question with software engineering skills, then took feedback from the problems of the Audacity sessions, and came up with a tool that solved all our problems.

That tool, tentatively called Yorùbá Voice Speech Recorder, was ready after a few months. In it, all the lines assigned to one subject are loaded before the recording session, and as soon as each line is recorded, a new line pops up, and so on. The recorded lines were stored with a unique number and date, and could be accessed later for analysis. Unlike Audacity, the application excluded all the conversation before and after each take, recording only the necessary audio segment, making the work faster. We will be releasing the Yorùbá Voice Speech Recorder app along with the data from this project, as our contribution to future work in this direction.

Duration and result

The project was planned to take eighteen months, but took almost two years. This was because of earlier-stated obstacles in creating elicitation apps, finding volunteers, setting up a studio, getting the analysis crucial to interpreting the data, and our own individual schedules. Each member of the team had their own preoccupations and, not being directly paid for their work on this project, they had to sometimes prioritize their day jobs over volunteer efforts.

The post-production of the audio data was, perhaps, the longest part of the project. Because the audio had been elicited using both Audacity and our YorubaVoice Speech Recorder application, there needed to be a harmonization, so a long period was dedicated to ensuring that each of the lines recorded was well labeled. Then another period was dedicated to listening to each and every of the 25,000 lines to ensure that the eventual recording matched the lines that were supposed to guide them. What we found, as is common in any human endeavor, was that some lines didn’t quite match, either because of a speaker mistake, audio malfunction, line error, or even just human oversight. As a result, some lines had to be edited to match the audio, some audio was edited to match the lines, and some lines were deleted outright and re-recorded at a later studio session.

This meticulous process was important to ensure that the audio we release is the best result of our effort over the two years. But it came at the expense of speed.
By the end of 2023, we had created a text-to-speech application model from the audio data we elicited, along with an Automatic Speech Recognition model that can benefit Yorùbá language users. All of this would be made available soon. We have also written and submitted a few academic papers on the work and its results.

Impact

The impact of our work, we imagine, will be the creation of a foundation on which a number of language technology applications can be built for the Yorùbá language, as well as a demonstration for other low-resource languages in Nigeria and Africa, showing the ways in which the problem can successfully be tackled. The model of our work is replicable for any and all languages on the continent. With 20,000 euros, you can create 50 hours of solid audio corpus, empower the creation of language applications that can be used by all, create experiments in text-to-speech and automatic speech recognition, and add to a growing body of knowledge about language technology in African languages.
Our work is different from others in its scope and significance. No other African language corpora work is as ambitious as this one, though we hope that ours would not be the last. Fifty hours is good but should not be the ceiling. There’s also a lot more work to be done for code-switching data, which can also benefit other Yorùbá speakers who do not speak without adding a second language. But the work we have done here will have a significant impact in springing the Yorùbá language forward in its presence in the language technology space, and benefit others hoping to create commercial or research products for users around the world.

Who are the team members?

Kọ́lá Túbọ̀sún

Myself, team lead and principal applicant. My work was to conceptualize the scope and direction of the project along with others, set targets and goals, assign tasks, manage the budget, and coordinate the overall research from beginning to end.
My experience working at Google with a smaller scale version of this project was helpful in understanding the scale and scope of the work and the anticipated problems.

Iroro Orife

Audio engineer and speech researcher, led the design of the Yorùbá Voice Speech Recorder app, our native audio data eliciting application. His experience with audio engineering was helpful at each stage of the data elicitation, studio work, dataset preparation and quality control. Tolulope Ogunremi, Computer Science PhD Student at Stanford was the lead researcher with ASR and TTS phases of the project. Along with being experienced in language research, her knowledge of data elicitation was helpful during the project, while she also coordinated the writing of the academic paper.

David Adelani

A research fellow at University College London. He, as a computer science researcher, led the efforts to create models out of the research corpora, as well as coordinate the paper writings. His experience in field research for low resource languages was also relevant in the project design.

Arem Adeola Jr

a recent graduate of linguistics at the University of Lagos, deputized with the elicitation of data, either as lead studio supervisor or assistant.
His competence in Yorùbá was invaluable.
His presence, enthusiasm, and dedication on the ground, and expertise made the work possible in Nigeria. Without his physical presence and attention, the work would have failed or been significantly delayed.

Although we were all based in different parts of the world, and moved around quite a bit during the project, the cooperation and coordination of each member helped bring the work to the desired conclusion.

Conclusions

The challenges facing African languages in technology are enormous, and will not be surmounted without consistent, concerted effort and funding. What Imminent has done here by supporting efforts in a positive direction, without the expectation of any rewards in the process, is one good example of how this can and should be done. We need more efforts by African organizations themselves, as well as global institutions interested in inclusion and equity. We have over two thousand languages on the continent but only a handful of them can even dream of surviving in the internet age because of a myriad of obstacles, some as small as orthography and others as large as the absence of a large online corpora. Each of these issues can be solved, one at a time, but only with dedicated enthusiasts for the work, and the presence of large support grants that can make that work possible. We are grateful for this grant and hope to find more ways, and more people, to build on the foundation we have set here for a better future.

Kọ́lá Túbọ̀sún

Linguist, writer | Helping African languages grow on the internet through research, technology, literature, education, advocacy, translations, documentation, and lexicography.

Linguist, creative writer, teacher, and project manager with over a decade-long experience in language technology, research, localisation, translation, literature, art & culture, and language advocacy for African (and Nigerian) languages.

Introduction

What was the issue to be addressed?

Project design

Imminent Research Grants

$100,000 to fund language technology innovators

Challenges and innovations

Duration and result

Impact

Who are the team members?

Kọ́lá Túbọ̀sún

Iroro Orife

David Adelani

Arem Adeola Jr

Conclusions

Kọ́lá Túbọ̀sún

Linguist, writer | Helping African languages grow on the internet through research, technology, literature, education, advocacy, translations, documentation, and lexicography.

Log into your account

Sign up to Imminent

Reset your password

Language is what makes us human.