Translated's Research Center

Humans and AI: the challenge of languages in Africa

Culture + Technology, Localization


Main topics and issues discussed
at the Imminent Unconference in Cape Town

Cape Town, October 24th-26th, 2023.

36 leaders in the language, translation, localization, and artificial intelligence industries came together for the second edition of the Imminent Unconference – after the first edition in London & Southampton – on the theme of “What should language AI be in Africa?”. 

An unconference is a loosely structured conference with no program set in advance and discussions organized around topics suggested by the participants, and aims at emphasize the informal exchange of information and ideas. The following are four major subjects that were discussed. You will find below a summary of the main discussions, without attribution to individual contributors.

1. Language investment choices: commercial or socially driven? 

“It is not going to work if we keep on seeing everything from the perspective of one side”

Commercial and social purposes both play an important role in the language investment decision process, even though they do not always have the same weight.  But several other factors need to be taken into account, including: the importance of empowering people, governments – funding and legislation – and culture. 

Empowering people

The key importance of empowering people: communities must be involved in deciding how tools and solutions could be useful for them. The first step is to allow people to be involved, which means giving them access to the content, which most of the time is just in English. In Kenya, for example, where many young entrepreneurs don’t speak English, but the business info is only in English and therefore inaccessible, a project was developed to build a chatbot with the aim of giving access to those who do not speak English. It was a socially driven initiative but an introductory course for economic development.

Governments 

It can be a continuum: social and economic purposes can both contribute to enabling people to make decisions for themselves that empower them and continue to invest in those people and scale up. And in this virtuous system, governments play a role in determining both funding and/or legislation. There is more than one way: regarding funding, as in Europe, governments have funded the national development of language resources, and this turned out to be effective. In some African countries, public funding turned out to be an ineffective way to reach this goal. In terms of legislation it could have a disruptive effect, like in Nigeria, where legislation requires that every commercial activity localized by Nigeria must be done by Nigerians only. This changed the market, creating new jobs and empowering Nigerians.  

Culture & Society

Language is the most human thing. It contributes to shaping our cultures and that’s why culture’s power cannot be underestimated. Culture can drive change and new legislation that answers the need that is created. One example of that is the crime of Driving Under the Influence (DUI), which was implemented only once cars were introduced. But it is also evident that culture does not force change: in every African country people already speak their own native languages and there is already a demand for those languages, but this has not led to any progress towards language diversity. And when it does, there is still the funding problem to address. 

Bureaucracy, lack of resources, and the struggle to keep up with the times are just a few of the problems that need to be considered when public institutions are involved. South Africa has a mature and stable constitution, it has done a lot to level the playing field in terms of language, but the constitution’s implementation is constrained by a lack of funds. The velocity gap between changes in society and the speed of Parliament, which can’t catch up with cultural and social developments, is real. In order not to waste even more time, it becomes crucial to focus on what is exactly wanted from the government – to ask for a very specific thing that is part of  an already existing policy. Moreover, national governments are the ones who are called into the discussion, but maybe international institutions are more appropriate to address linguistic and cultural problems which are cross boundaries. 

Whether public or private, nationally or internationally funded, research is the essential factor in the choice of which languages to invest in. No matter what the factor that takes to it. In a capitalistic society, showing that there is monetary value in language, and that it is not just a cultural thing, can be a useful tactic to prioritize language diversity. Companies, not occasionally, have targeted localization in marginalized communities because of the growing market and economy in those communities. When companies want to reach these growing communities, they create a demand for language technology for minority languages. 

It must be emphasized, though, that it is not a question of commercial versus social motivations. Rather, it should be a combination of the two. It may not always follow the same process but all the factors mentioned above and more contribute to nurture a unique chain, and to make it work, each factor needs to be considered, and the right ones need to be chosen. 

2. In what language/s should we start to invest/translate first?

“New languages are being created a lot faster now.”

Are there  unique criteria to choose which language to invest in? And what are these criteria? Choosing on the basis of the number of speakers will leave minority languages behind. But the purpose that drives the choice can justify the choice itself. Commercial and social purposes come into play again. “If I’m selling to you I’ll speak your language, but if you’re selling to me then you need to speak my language.” Is this always true? The interest in selling something can lead to the opposite output: choose one language widely spoken to increase the overall reach, choose different minority languages to be sure to truly speak to the interested target and address it. It depends on the objective and, of course, on the content that is promoted. There is a compelling benefit in terms of engagement and trust by speaking the languages of your target market/community.

The same goes for social purposes. To connect people, prioritizing the most common language can be the best option in certain situations, while in others, multilingualism can be the only way to address the issue. Lingua francas can be evaluated as an option to start with too. But note that lingua francas, like Sheng – a mix of Swahili and English – which is widely spoken in urban East  Africa, follow cultural changes, cross boundaries, and are changing faster than digitalization. That means that languages can rise and evolve faster than our capacity to collect data, codify them, and standardize them. 

What can’t be questioned is the impact of multilingualism on education and the growth of the societies where these languages are spoken. The concept of growth implies inclusivity: the point is to grow society as a whole. From that perspective, there have been various attempts in different countries and different kinds of governments. A century ago, Italy was just one example. People throughout Italy only spoke dialects, a different one per region. To increase the number of Italian speakers, people were mixed together from different regions when building the army: Sicilians went to Friuli Venezia Giulia and vice versa, and this helped build interaction and consolidate a lingua franca. In this example, a governmental decision impacted the cultural and linguistic environment. 

But what other resources, factors and tools are relevant to take into consideration in order to answer the question and decide which language to invest in first? 

Interests cannot be underestimated  and the initiative by the technology company Meta, called“No Language Left Behind” clearly shows that. What drives them? It is not  necessarily to improve the quality of the content in every language, but rather to avoid toxicity on their platforms. It seems that Meta wants to detect discriminatory and political posts, in as many languages as possible, to perhaps address them in some way.

Compliance cannot be underrated either. It is a powerful tool that leads to fulfilling language diversity, to make companies, mainly big ones, pay for the development of languages. And big companies influence the behavior of other companies, both small and large. Airbnb’s platform is available in Maltese, even though English is the primary language in Malta. The investment in localization probably wasn’t made for explicit commercial reasons, but rather for compliance with EU regulations.

This leads back to the topic of regulation. The pharmaceutical industry is multilingual because it is highly regulated. Comprehension needs to be guaranteed: if there is one participant that speaks a particular language, then you have to translate everything for that one speaker. 

But even when regulations are involved, there can still be difficulties, as full translation is not always possible. Some terminology does not exist in some languages. This makes it evident that translation can just be a compromise and it cannot always represent the best solution. There are topics/areas where the only option is to include multilingual terminology development. 

And that’s not all. The format used for the translation process can impact the choice itself. Translation is not always possible, but transcription can be there when translation isn’t an option. To rely on different forms of translation, including video and audio translation, can lead to new answers and more languages.

English is shrinking, and that’s a fact. Hindi is increasing and there is evidence for this. But data is not always enough to make decisions, both business and social ones. A qualitative evaluation of different factors needs to be done to determine which language (or languages) to start with, and it must be kept in mind that the ranking changes along with the choice’s context and objective and, most of all, that a few languages represent just the beginning of a journey through and to multilingualism. 

3. AI data licensing for the long tail

“The efficiency of the product using the data is where the data value comes in”

Is there a market for licensing? How do we make it sustainable?  A strategic approach is necessary if we want to develop AI that is truly representative of the future and is sustainable. The strategy applied also has to take into account transparency, ownership, the monetary value of data and its free availability. 

In 99% of the cases, people selling data are not the owners, they are just the collectors. Data providers are not paid for the data they have collected. That’s why there is a willingness to develop a model in which grass-roots communities own the data and get paid to share it and get a license for that data. The data remains private but ownership and monetary value are guaranteed (a question mark remains regarding access and transparency). 

Grass-roots models are growing but are still the exception, and because it’s not easy to go back to the source of the data, and making it available for free seems to be a way to solve the ownership issue. Data is common and being the product of communities, it should be owned by the community. Recording data points can be another option: recording the author of that data and recording the dates and location or jurisdiction as well. Ideally this should be built into the infrastructure. The point here is how to include authorship in data packets, then weigh a reward based on how valuable the data was in any given query. Although most data is private and available for an offer, it has to be considered and noted that communities are not generally well informed about the potential profit and the amount of money they could make out of data.

However, does data in language/translation/AI have a real quantifiable monetary value? It’s difficult to determine the ratio of the value of data. Most of the time there’s almost nothing of value. The efficiency of the product using the data is where the value comes in. It’s not a ratio of 1 to a certain number. The value is built through the additional refinement of the data. It requires informed and systematic well engineered interactions: if you sell your data to Google, then you say “we reserve the right to improve on this data”. 

And, assuming that there is a value, is the value static? Languages change over time just like humans and cultures. So the data needs to be integrated and curated because context also matters. But if the value presents itself in the service offering which uses the language technology built using the data – it is about what you can build with the data. Then the focus shifts to what’s a sustainable way for SMEs to create data, rather than buying it. And this means to look for technology and models that require less data, less training and resources, and then build from there. It may not necessarily lead to getting better with more data. Maybe the quality vs quantity approach is another way to solve the problem and to approach licensing in a more sustainable way. 

4. Frugality: less is more

“Here is where the constraint often becomes the push and the solution”

The necessity to work with fewer resources can turn into an opportunity, as you have to be very creative in addressing the problem. When entrepreneurs find a way to solve a problem with scarce resources, they open up a world of opportunities. It means companies can generate more profits and be more resilient facing of fluctuating economic conditions.

But working with fewer resources can also be a choice dictated by the objective of the project. For instance, : “big tech” is interested in broad reaching generic activities that may require a huge amount of resources, whereas local players in Africa may be more interested in a much narrower scope in terms of languages and use cases, an application in which fewer resources can be enough. 

And what if the less is more approach gives rise to alternative Large Language Models? “Small LLMs” or “Small LMs” or just “LMs” that don’t need to have such a broad coverage, but can have a really narrow focus that can make them more effective, but also easier to host or deploy, especially in scenarios where there is limited space (in terms of hardware storage) or limited to no bandwidth. A sustainable model that can be a solution to the scarcity of not only data resources, but also energy resources. 

Frugality leads to scenarios we are not familiar with, and that’s why it represents an opportunity for local business as well as big tech companies. 

We need to work towards a situation where local players with an understanding of local needs and the local context,take initiative, with or without the help of the “Global North” – the latter perhaps providing data collection or computing resources to get started if needed. 

Following this path, African companies are working to solve local problems – they are on the edge of building something and no one else is going to compete, because they know how to do things without using too much technology, and no one else can do it – but once they crack the code to building successful companies in their home markets, they will have an admirable competitive advantage even in the larger global market. 


Photo credits: Dino Domenico Codevilla