Translated's Research Center

Imminent Research Grants – Exclusive content – EMNLP, Florida 2024

Neuroscience – 2023 Winning project

An exclusive content on the last project funded by the Imminent Grants

Pauline Larrouy-Maestri

Pauline Larrouy-Maestri

Senior researcher in the music department of the Max-Planck Institute for Empirical Aesthetics

Studying the relation between sounds and meaning, Pauline is a classically trained pianist, a speech pathologist, and holds a PhD in cognitive sciences. She is the PI of the team and recipient of the Imminent Award.

What makes speech sound “human”?

Synthetic speech is everywhere and Siri’s or Alexa’s voices take more and more space in our living rooms. Text-to-speech (TTS) tools aim at creating intelligible and realistic voices. With TTS systems quickly getting better, are we still able to distinguish TTS voices from human ones? If yes, how?

A team of researchers from the Max Planck Institute for Empirical Aesthetics combine interdisciplinary expertise to answer these questions with the support of an Imminent grant.

Are we fooled by TTS systems?
To test if we are (still) able to tease apart human and non-human voices, we presented 75 participants with synthetic voices generated with two different TTS systems (Murf, Lovo), as well as with human voices (from the RAVDESS dataset of human vocalizations), speaking the same sentence in different emotional profiles (neutral, happy, sad, angry). Participants were asked to indicate if they suspected some of the voices were generated with a computer and were then asked to sort the stimuli into “human” or “AI-generated”. Even though 76% of participants indicated having suspected some of the voices to be generated with a computer, recognition of AI voices in the categorization task was quite variable (Figure 1). Human voices were recognized as such well above chance level (85.6% correct responses) and synthetic voices were recognized slightly above chance level (proportion of correct responses: 55.2%), which supports that TTS systems were sometimes fooling listeners into perceiving the synthetic voices as human ones.

Density distribution of proportion of correct recognition of AI-generated (orange) and human (blue) voices. 

Are TTS systems equally good?
The short answer is “no”. In a recent experiment (pre-registration: osf.io/nzd9t), German participants listened to spoken German sentences either recorded from humans or generated by TTS tools (Microsoft, Elevenlabs, Murf, and Listnr). Forty listeners evaluated 1024 speech samples in terms of how human each sample sounded to them. As illustrated in Figure 2, human voices are generally perceived to sound more human than the TTS generated voices. In line with the previous study, the human voices all sound similarly human, but there is a much larger variation among the TTS voices, indicating differences in quality. Of special note are the two voices offered by Elevenlabs, as they generally fool people into thinking they are real humans, with the median humanness rating for these voices above 50 (on a scale ranging from 1 to 100). Interestingly, participants often disagree with each other on how human the voices sound, as highlighted by the large range of colored individual data points in Figure 2.

Humanness rating by voice

Perceived humanness rating (from 1 to 100) for all 16 voices. M stands for self-reported male and F for self-reported female. Voices are ranked from high to low according to their median humanness rating which is indicated by the thick line within the box. Colored dots indicate individual data points, blue for human voices, orange for TTS voices.

Where does (the lack of) humanness come from? 
We can now confidently say that synthetic voices contain cues that enable us to decide that they are not “human” but the nature of these cues remains unclear. Voice quality is of course an important factor. Anyone who has heard the voice of Stephen Hawking, with its specific robotic timbre, can identify it as produced by a computer (or, more specifically, generated by a computer based on the voice of the scientist Dennis Klatt). More generally, the acoustic characteristics of a voice (in terms of timbre but also in terms of pitch and loudness) are important cues that provide information about the speaker’s identity, such as age, gender, or even health. For instance, high pitch voices are generally associated with children’s voices, and pitch typically deepens with age (and associated hormonal changes). It is thus expected that the acoustic profile also provides information about its “humanness”. 

Besides voice quality/profile, speech prosody, also called the melody of speech or intonation, is another prominent candidate to cue (non-)humanness of speech. Prosody has a key role in human-to-human interactions, as well as human-to-computer interactions, since it influences how we understand a message as well as how we respond to it. Unfortunately, no precise mapping between prosodic patterns and meaning has been identified so far (e.g., van Rijn & Larrouy-Maestri, 2023, in the case of emotional prosody), but we know that prosody is linked to the syntactic structure of a phrase and the meaning of the words (Cutler et al., 1997). Indeed, the position of the words in a sentence (following grammatical rules), as well as their semantic meaning, affect how we intonate speech.

Focus on speech prosody
To investigate the role of speech prosody, we created various versions of phrases spoken by humans and TTS systems. Some were “normal” (i.e., regular) sentences with both syntactic and semantic information, while others conveyed syntactic but not semantic information (jabberwocky sentences), no syntactic but semantic information (wordlists), or had no syntactic or semantic information (jabberwocky wordlists). As illustrated in Figure 3, the results from forty participants showed the effect of both syntactic and semantic information on humanness perception. Indeed, both TTS and human voices were rated higher in terms of “humanness” when they were speaking normal sentences than in the other stimulus conditions (all p < .01). 

Humanness rating by sentence type

Perceived humanness rating by sentence type for human and TTS separately. TTS voices are perceived as less human than human voices (replication of our previous findings) and regular sentences are perceived as more human regardless of the voice type. S = regular sentences, JS = jabberwocky sentences, W = wordlists, JW = jabberwocky wordlists.

Conclusion and future plans
The findings of this project, supported by an Imminent grant, reveal that people are (still) able to discriminate between human and TTS voices, contrary to what is claimed by AI-supporters. In other words, TTS voices don’t sound exactly human yet and it is thus crucial to figure out if the “lack of humanness” affects how we process speech and more generally our communication abilities. Interestingly, our results also reveal that the quality of the TTS systems (assuming that the highest quality is being able to fool listeners) varies greatly, with some TTS voices clearly recognized as synthetic and others identified as human, and that we are not all similarly sensitive to humanness (i.e., as visible with the large individuals’ variability). Finally, we demonstrated here the role of prosody, in addition to the quality of a voice, in driving listeners’ perception of humanness. The natural next step in our research is to manipulate such factors and examine the neural responses that trigger the perception of (non-)humanness. Ultimately, our research aims to understand how we process these increasingly ubiquitous synthetic voices, to clarify how they might affect us and possibly alleviate negative consequences.

References

Bruder, C., & Larrouy-Maestri, P. (2023, September 06-09). Attractiveness and social appeal of synthetic voices. Paper presented at the 23rd Conference of the European Society for Cognitive Psychology, Porto, Portugal.

Cutler, A., Dahan, D., & Van Donselaar, W. (1997). Prosody in the Comprehension of Spoken Language: A Literature Review. Language and Speech, 40(2), 141–201. https://doi.org/10.1177/002383099704000203

Van Rijn, P., & Larrouy-Maestri, P. (2023). Modeling individual and cross-cultural variation in the mapping of emotions to speech prosody. Nature Human Behavior 7, 386-396. http://doi.org/10.1038/s41562-022-01505-5

Wester, J., & Larrouy-Maestri, P. (2024, July 22). Effect of syntax and semantics on the perception of “humanness” in speech. Retrieved from osf.io/nzd9t

Team members

Neeraj Sharma (https://neerajww.github.io/), collaborator of the team, is a signal processing expert who is an assistant professor at the School of Data Science and Artificial Intelligence, in the Indian Institute of Technology, Guwahati (India)

Camila Bruder With a mixed background in biology and music (she is also a professionally trained classical singer), Camila has recently submitted her PhD dissertation on the topic of singing voice perception.

Janniek Wester Janniek joined the team in 2023 to complete a PhD with a focus on humanness and developed the second experiment presented here.

Acknowledgments: Pamela Breda, Melanie Wald-Fuhrmann, and T. Ata Aydin, for their constructive comments during the development of this project.


Imminent Science Spotlight

Imminent Science Spotlight

Academic lectures for language enthusiasts

Imminent Science Spotlight is a new section where the next wave of language, technology, and Imminent discoveries awaits you. Curated monthly by our team of forward-thinking researchers, this is your go-to space for the latest in academic insight on transformative topics like large language models (LLM), machine translation (MT), text-to-speech, and more. Every article is a deep dive into ideas on the brink of change—handpicked for those who crave what’s next, now.

Discover more here!

Winning Projects from the Latest Edition

Human-Computer Interaction

The impact of Speech Synthesis on cognitive load, productivity, and quality during post-editing machine translation (PEMT).
Dragoș Ciobanu
University of Wien

The increased fluency of neural machine translation (NMT) output recorded in certain language pairs and domains justifies its large-scale deployment, yet professional translators are still cautious about adopting this technology. Among their main concerns is the already-documented “NMT fluency trap” that causes translators to miss significant Accuracy errors masked by the NMT output’s high fluency.  

The Human and Artificial Intelligence in Translation (HAITrans) research group at the University of Vienna has been investigating the potential of speech technologies—synthesis and recognition—to improve the quality of professional and trainee translators’ work. This project will specifically build on experiments involving speech synthesis in the revision and post-editing processes, which show a superior level of Accuracy error detection and correction when synthesis is present. Given findings that revising with speech synthesis does not have a negative impact on revisers’ cognitive load, the researchers will use our eye-tracking lab to investigate cognitive load and productivity when post-editing with sound versus in silence. Should the PEMT findings mirror our work on revision, the expectation is that translators will feel reassured that, when integrating speech synthesis into their PEMT workflows, this technology will help them avoid the NMT fluency trap without compromising productivity or increasing cognitive load.  

Neuroscience of language

The neuroscience of translation. Novel and dead metaphors processing in native and second-language speakers.
Martina Ardizzi & Valentina Cuccio 
University of Parma 

Language development, production, and comprehension cannot be divorced from lived, corporeal experience to the point that they are literally embodied (Cuccio & Gallese, 2018). A growing body of work has indeed implicated the sensorimotor system in semantic comprehension of native language. Unfortunately, the embodied character of a second language has been severely under-investigated. The present project will fill this gap by testing, in a functional magnetic resonance imaging (fMRI) study, how the comprehension of dead and novel metaphors involves the sensorimotor system in native and second-language speakers. Recent findings have shown that dead metaphors—the ones repeated ad nauseam whose literal meaning is no longer accessed—are processed outside the sensorimotor system (Yang & Shu, 2016). The expectation is  that dead metaphors will recruit the sensorimotor system differently in native speakers compared to second-language speakers. A multidisciplinary approach merging the competencies of Valentina Cuccio (PI, Philosopher, Assistant Professor, University of Messina) author of recent theoretical and empirical advances in the embodied approach to language especially in metaphors understanding, with the skills of Martina Ardizzi (co-PI, Neuroscientist, Assistant Professor, University of Parma) expert in fMRI specifically applied to the role of the sensorimotor system in cognition. The NET project will lead to two scientific papers published in open-access journals. Furthermore, the novel application of an embodied approach to translation may provide new insights on how to improve the disembodied AI translations. Indeed, although worldwide there are thousands of different languages, speakers universally share the same corporeal experiences that could  ultimately ground linguistic meaning.

Machine learning algorithms for translation

Incremental parallel inference for machine translation
Andrera Santilli
La Sapienza, University of Rome

Machine translation works with a de facto standard neural network called Transformer, published in 2017 by a team at Google Brain. The traditional way of producing new sentences from the Transformer is one word at a time, left to right; this is hard to speed up and parallelize. Andrea Santilli and his PhD supervisor Emanuele Rodolà, at the Sapienza University of Rome, are specialists in neural network architectures. They spotted that a similar problem is solved in image generation by using “incremental parallel processing,” a technique that refines an image progressively rather than generating it pixel by pixel, yielding speedups of 2–24×. They propose to port this method to Transformers, using clever linear algebra tricks to make it happen. At Translated, we hope that this technique and other similar ones makes machine translation less expensive and therefore accessible to a greater number of use cases, and ultimately more people.

Language Data

YorùbáVoice
Kọ́lá Túbọ̀sún
Independent Researcher 

Yoruba is one of the most widely spoken languages in Africa, with 46 million first- and second-language speakers. Yet there is hardly any language technology available in Yoruba to help them, especially illiterate or visually impaired people, who would benefit most. Translated’s vision is to build a world where everyone can understand and be understood. In this project, the team will work on “everyone,” developing speech technology in Yorùbá. The team is headed by Kola Tubosun, the founder of the YorubaNames, and four computer scientists and language enthusiasts with an excellent scientific track record, with publications at Interspeech, ACL, EMNLP, LREC, and ICLR. As a first action, aligned voice and text resources will be recorded professionally in a quality usable to produce text-to-speech systems. After donating this data to the Mozilla Common Voice repository under a Creative Commons license, further speech data will be collected from volunteers online. To increase the quality of the text, the team has already developed a diacritic restoration engine.

Language Economics

T-Index
Luciano Pietronero, Andrea Zaccaria, Giordano de Marzo
Enrico Fermi Research Center Team

Understanding which countries and languages dominate online sales is a key question for any company wishing to translate its website. The goal of this research project is to complement the T-Index by developing new tools capable of identifying emerging markets and opportunities, thereby predicting which languages will become more relevant in the future for a specific product in a specific country. As a first step, the team will rely on the Economic Fitness and Complexity algorithm to determine which countries will undergo major economic expansion in the next few years. Leveraging network science and machine learning techniques to predict the products and services that growing economies will start to import.

Human-Computer Interaction

Humanity of Speech
Pauline Larrouy-Maestri and team
Max Planck Institute

Synthetic speech is everywhere, from our living room to the communication channels that connect humans all over the world. Text-to-speech (TTS) tools and AI voice generators aim to create intelligible, realistic sounds to be understood by humans. Whereas intelligibility is generally accomplished, the voices do not sound natural and lack “humanity,” which affects users’ engagement in human-computer interaction.  In this project, the team aims to understand what a “human” voice is—a crucial issue in all domains relative to language, such as computer, psychological, biological, and social sciences. To do so, they will 1) investigate the timbral and prosodic features that are used by listeners to identify human speech, and 2) determine how “humanness” is categorized and transmitted. Concretely, we (my group and I) plan to run a series of online experiments using methods from psychophysics. The research will focus both on the speech signal—through extensive acoustic analyses and manipulation of samples—as well as on the cognitive and social processes involved. As can be seen on Pauline’s website (https://pauline-lm.github.io/), she is a senior researcher at the Max Planck Institute, with a master’s in speech science, PhD in cognitive psychology, and postdoctoral work in neuroscience. The extensive expertise of my team in the topics involved in the proposed project (see the articles about emotional prosody, auditory perception, acoustics, voice, etc., published in high-impact, peer-reviewed journals) makes us the ideal candidates to successfully investigate the “humanity of speech.”

Usability of explainable error detection for post-editing neural machine translation
Gabriele Sarti
University of Groningen

Predictive uncertainty and other information extracted from MT models provide reasonable estimates of word-level translation quality. However, there is a lack of public studies investigating the impact of error detection methods on post-editing performance in real-world settings. We propose to conduct a user study with professional translators for two language directions sourced from our recent DivEMT dataset. The team aims to assess whether and how error span highlights can improve post-editing productivity while preserving translation quality. The research will focus on the influence of highlight quality by comparing (un)supervised techniques with best-case estimates using gold human edits, using productivity and enjoyability metrics for evaluation. The findings will be made available in an international publication alongside a public release of code and data. Such direction would be relevant to Translated to validate the applicability of error-detection techniques aimed at improving human-machine collaboration in translation workflows. The proposal is a reality check for research in interpretability and quality estimation, and will likely impact future research in these areas. Moreover, positive outcomes could drive innovation in post-editing practices for the industry. The project will be led by Gabriele Sarti and Arianna Bisazza, as part of his InDeep PhD project on user-centric interpretability for machine translation. The advisory team will include Ana Guerberof-Arenas, and Malvina Nissim, internationally recognized researchers in translation studies and natural language processing.

Machine Learning algorithms for machine translation

Open-sourcing a recent text to speech paper
Phillip Wang
Independent Researcher

Open-source implementations of scientific papers are one of the essential means by which progress in deep learning is achieved today. Corporate players have stopped open-sourcing recent text-to-speech (TTS) model architectures, often not even trained models. Instead, they tend to publish a scientific paper, sometimes with details in additional material, and an accompanying demo with pre-generated audio snippets.  The proposal is to improve this situation by implementing a recent TTS paper such as Voicebox, open-sourcing our architecture. In addition, as much as possible, we would like to collect training data, train the model, and demonstrate that the open-sourced architecture performs well, for example by illustrating notable features or approximately reproducing some performance results (e.g., CMOS).

NEUROSCIENCE OF LANGUAGE

Tracking interhemispheric interactions and neural plasticity between frontal areas in the bilingual brain
Simone Battaglia
Alma Mater Studiorum – University of Bologna

Which is the human brain network that supports excellence in simultaneous spoken-language interpretation? Although there is still no clear answer to this question, recent research in neuroscience has suggested that the dorsolateral prefrontal cortex (dlPFC) is consistently involved in bilingual language use and cognitive control, including working memory (WM), which, in turn, is particularly important for simultaneous interpretation and translation. Importantly, preliminary evidence has shown that functional connectivity between prefrontal regions correlates with efficiently processing a second language. The present project proposal aims to characterize space-time features of interhemispheric interactions between left and right dlPFC in bilingual healthy adults divided into two groups of professional simultaneous interpreters and non-expert bilingual individuals. In these two groups, we will use cutting-edge neurophysiological methods for testing the dynamics of cortico-cortical connectivity, namely TMS-EEG co-registration, focusing on bilateral dlPFC connectivity. The procedure will make it possible to non-invasively stimulate the dlPFC and track signal propagation, to characterize the link between different aspects of language processing, executive functions, and bilateral dlPFC connectivity. The team of neuroscientists and linguists will provide novel insights into the neural mechanisms of interhemispheric communication in the bilingual brain and characterize the pattern of connectivity associated with proficient simultaneous interpretation.

HUMAN COMPUTER INTERACTION

How can MT and PE help literature cross borders and reach wider audiences: A Case Study
Vilelmini Sosoni
Ionian University

Researchers studied the usability and creativity of machine translation (MT) in literary texts focusing on translators’ perceptions and readers’ responses. But what do authors think? Is post-editing of MT output an answer to having more literature translated, especially from lesser-used languages into dominant languages? The study seeks to answer this question by focusing on the book Tango in Blue Nights (2024), a flash story collection about love written by Vassilis Manoussakis, a Greek author, researcher, and translator. The book is translated from Greek into English using Translated’s ModernMT system and is then post-edited by second-year Modern Greek students at Boston University who are native English speakers and have near-native proficiency in Greek. They follow detailed post-editing (PE) guidelines developed for literary texts by the researchers. The author analyzes the post-edited version and establishes whether it is fit for publication and how it can be improved. A stylometric analysis is conducted. The study is the first of its kind and wishes to showcase the importance of MT for the dissemination of literature written in lesser-used languages and provide a post-editing protocol for the translation of literary texts. 

Neuroscience of Language

Realtime Multilingual Translation from Brain Dynamics
Weihao Xia
University College London

This project, Realtime Multilingual Translation from Brain Dynamics, is to convert brain waves into multiple natural languages. The goal is to develop a novel brain-computer interface capable of open-vocabulary electroencephalographic (EEG)-to-multilingual translation, facilitating seamless communication. The idea is to align EEG waves with pre-aligned embedding vectors from Multilingual Large Language Models (LLMs). The multi-languages are aligned in the vector space, allowing us to train the model with only a text corpus in one language. EEG signals are real-time and non-invasive but exhibit significant individual variances. The challenges lie in the EEG–language alignment and across-user generalization. The learned brain representations are then decoded into the desired language using LLMs such as BLOOM that produces coherent text almost indistinguishable from text written by humans. Currently, the primary application targets individuals who are unable to speak or type. However, in the future, as brain signals increasingly serve as the control factor for electrical devices, the potential applications will expand to encompass a broader range of scenarios. 

Machine learning algorithms for translation

Language Models Are More Than Classifiers: Rethinking Interpretability in the Presence of Intrinsic Uncertainty
Julius Cheng
University of Cambridge

Language translation is an intrinsically ambiguous task, where one sentence has many possible translations. This fact, combined with the practice of training neural language models (LMs) with large bitext corpora, leads to the well-documented phenomenon that these models allocate probability mass to many semantically similar yet lexically diverse sentences. Consequently, decoding objectives like minimum Bayes risk (MBR), which aggregate information across the entire output distribution, produce higher-quality outputs than beam search. Research on interpretability and explainability for natural language generation (NLG) has to date almost exclusively focused on generating explanations for a single prediction, yet LMs have many plausible high probability predictions. Julius’s team proposes to adapt interpretability to this context by asking the question, “do similar predictions have similar explanations?” They will answer this by comparing explanations generated by interpretability methods such as attention-based interpretability, layerwise relevance propagation, and gradient-based attribution across predictions. The goal of this project is to advance research in interpretability for NLG, deepen our understanding of the generalization capabilities of LMs, and develop new methods for MBR decoding.

LANGUAGE DATA

Curvature-based Machine Translation Dataset Curation
Michalis Korakakis
University of Cambridge

Despite recent advances in neural machine translation, data quality continues to play a crucial role in model performance, robustness, and fairness. However, current approaches to curating machine translation datasets rely on domain-specific heuristics, and assume that datasets contain only one specific type of problematic instances, such as noise. Consequently, these methods fail to systematically analyze how various types of training instances—such as noisy, atypical, and underrepresented instances—affect model behavior. To address this, Michalis’s team proposes to introduce a data curation method that identifies different types of training instances within a dataset by examining the curvature of the loss landscape around an instance—i.e., the magnitude of the eigenvalues of the Hessian of the loss with respect to that instance. Unlike previous approaches, the proposed method offers a comprehensive framework that provides insights into machine translation datasets independent of model architecture and weight initialisation. It is also applicable to any language pair and monolingual translation tasks such as text summarisation.

Language economicsLanguage economics

Development of a Multilingual Machine Translator for Philippine Languages

Curvature-based Machine Translation Dataset Curation
Charibeth Cheng
De La Salle University

The Philippines is an archipelagic country consisting of more than 7,000 islands, and this has contributed to its vast linguistic diversity. The Philippines is home to 175 living, indigenous languages, with Filipino designated as the national language. Within formal education, 28 indigenous languages serve as mediums of instruction, alongside English, which holds official status in business, government, and academia. The Philippines’ diverse linguistic landscape underscores the need for effective communication bridges. The project aims to develop a multilingual machine translation system for at least seven Philippine languages, aligning with efforts to standardize and preserve indigenous languages.  Multilingual machine-translation systems serve as vital bridges between speakers of different languages, fostering cultural inclusivity and bolstering educational and socioeconomic progress nationwide.  This project aims to develop a multilingual machine translation system capable of translating text across at least seven Philippine languages.

Specifically, this project will focus on  the following:

1. Collect and curate linguistic data sets in collaboration with linguistic experts and native speakers to ensure the accuracy and reliability of the translation system.

2. Implement machine-learning algorithms and natural language-processing techniques to train the translation model, considering the low-resource nature of Philippine languages.

3. Evaluate the efficacy of the translation system developed using standardized metrics and human evaluation.


The 2025 call is open. Submit your project.

Imminent was founded to help innovators who share the goal of making it easier for everyone living in our multilingual world to understand and be understood by everyone else. Imminent builds a bridge between the world of research and

the corporate world by supporting research through scientific publications, interviews, and annual grants, funding ground- breaking projects in the language industry.

With the Imminent Research Grants project, each year, Immi- nent allocates $100,000 to fund five original research projects with grants of $20,000 each to explore the most advanced frontiers in the world of language services. Imminent expects the call to appeal to startuppers, researchers, innovators, authors, university labs, organizations and companies. 

The 2025 call is open.