Curated monthly by our team of forward-thinking researchers, the latest in academic insights on language and technology: a deep dive into ideas on the brink of change about large language models (LLM), machine translation (MT), text-to-speech, and more.

Do you want to receive the latest academic insights right in your inbox?

Subscribe to our newsletter

July 2025

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
Human reasoning transcends structured, step-by-step, linguistically framable thoughts, but rather comprises a subtle, complex train of intermingled linguistic and visual elements. When it comes to LLMs, however, most of the predominant approaches revolve around the popular Chain of Thought paradigm that involves following a series of linguistic thought units prior to arriving upon a final solution. Reflecting on the limitations of this paradigm in multimodal applications, the authors propose a multi-modal thought space: where the models “think” in terms of interleaved words and images that they generate across the reasoning trace. In modeling, the authors introduce a “Token Discrepancy Loss” to bridge the learning gap between visual and language embedding spaces. Utilizing the Chameleon-7B multimodal architecture, show the effectiveness of their approach on three spatial reasoning tasks.
Read the full paper here

Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages
On the lines that a human bilingual typically finds both languages sharing similar regions of the brain, the authors set out to explore the similarities with multilingual learning inside large language models. Employing gated sparse autoencoders as a mechanism to interpret the multilingual nature of intermediate activations with models, they discover the sharing of morphosyntactic concepts across typologically diverse languages. Their evidence suggests the idea that LLMs develop an internal lingua franca of abstract grammatical concepts in spite of being trained predominantly on English corpus – thus challenging the necessity of language-balanced datasets. In an interesting case study on machine translation, they show the ability to control the translation by manipulating features related to tense, gender, and numeric concepts, but without incurring a notable impact on another concept.
Read the full paper here

June 2025

WorldCuisine: Cross-Cultural Recipe Understanding with Multilingual Models
In the growing landscape of Vision-Language Models (VLMs), cultural and linguistic diversity often remains underexplored, with most benchmarks skewed toward English and Western-centric contexts. The authors address this critical gap by introducing WorldCuisine, a large-scale benchmark designed to evaluate multicultural and multilingual visual-language understanding through the lens of food — a universal yet deeply culture-specific domain. Comprising over a million text-image pairs spanning 30 languages and dialects from nine language families, WorldCuisine poses culturally grounded visual question-answering (VQA) tasks involving dish identification and its regional attribution. By testing models in both authentic and adversarial geographic contexts, the benchmark surfaces current limitations in the ability to reason in adversarial settings. Alongside the dataset, the authors also release a rich knowledge base of annotated food concepts.
Read the full paper here

REL-A.I.: An Interaction-Centered Approach To Measuring Human-LM Reliance
As LLMs become integral to human-AI interactions, traditional evaluations of model uncertainty—focused on verbal or numerical calibration—fail to capture a critical aspect: how humans respond to model outputs. In REL-A.I. (pronounced “rely”), the authors propose an interaction-centered framework that directly measures human reliance on LLMs. Through controlled studies, they show that contextual factors—such as the domain of the question, the model’s typical tone of confidence, and even polite greetings like “I’m happy to help!”—can significantly alter user behavior, increasing reliance by up to 30%. Their findings thus reveal that seemingly well-calibrated models can still induce risky human behavior, underscoring the need to evaluate LLMs not just by what they say, but by how their outputs shape user decisions.
Read the full paper here

May 2025

Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion
Amongst the stream of research that seeks an optimal inference-time utilization of LLMs, the technique of speculative decoding entails making use of a smaller “draft” model that generates, or “speculates” a few additional future tokens that can then be parallelly evaluated by the target LLM. In contrast to most of the existing works that tend to use an autoregressive language model as the drafter, the authors challenge this paradigm and instead propose using a non-autoregressive diffusion language model. These models can inherently parallelize the draft token generation, thus allowing the size of the draft to be scaled. Leveraging their Masked Diffusion Language Model, the authors demonstrate a speed-up of up to 1.75x on classical speculative decoding approaches.
Read the full paper here

LLM-SR: Scientific Equation Discovery via Programming with Large Language Models
The field of Symbolic Regression (SR) entails discovering the underlying math equations governing distributions of data. Classical techniques involve large iterations to search across a vast combinatorial space, but without much scientific motivation. The authors, taking cognizance of the scientific understanding and code generation abilities in general-purpose LLMs, propose a more efficient method for equation discovery. Representing equation structures as Python programs, they prompt a guided LLM to produce such equation proposals. The proposals are mathematically optimized in search of convergence – the best ones being prompted back to the LLM in an iterative approach. They demonstrate much quicker and better convergence through this LLM-guided strategy across three data domains.
Read the full paper here

April 2025

Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions
Juxtaposition is a comedic device that, playing about a two-panel illustration, underpins complex non-linear reasoning to bring about humor and irony. For LLMs that have been setting newer benchmarks with their impressive demonstrations on problems involving deep comprehension and reasoning, also important would be gaining a simultaneous understanding of their creative appreciation capabilities. Connecting these dots, the authors propose the “YesBut” dataset designed to measure the ability of models to understand and interpret juxtapositions in comics. The dataset provides a way to evaluate four increasingly difficult capabilities in narrative understanding, starting from literal description writing to its abstraction into an appropriate title. Through their studies, they show that even for leading commercial LLMs, a subtle comical intelligence is far from attained.
Read the full paper here

Contrastive preference optimization: pushing the boundaries of LLM performance in machine translation
Most of the Machine Translation (MT) research by far has always centered on the availability of human translations in different ways, with recent successes leveraging them to finetune LLMs with few, high-quality data. This work challenges this paradigm, building upon two important developments in the field: the availability of models to be on par or better than an average human translator, and the reliability and preferability of reference-free metrics for MT. They propose the “Continuous Preference Optimization (CPO)” objective – a theoretical simplification of the DPO objective used to preference-tune LLMs – enhancing an MT model to generate a higher-quality translation among two high-quality outputs. The resulting model substantially improves the performance over its counterpart without the CPO objective and closes the performance gap of open-source models against leading commercial LLMs.
Read the full paper here

March 2025

D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models
To specialize a general-purpose language model on a target domain, extending the model’s pre-training through adaptation on a data mixture having domain-specific data can be a crucial step. For this stage, it is important to determine an optimal mixture ratio (of the adaptation data in the training data) under a fixed compute budget. To avoid incurring expensive searches for the practically optimal data mixture ratio, the authors devise a scaling law to predict the resultant validation loss as a function of the model size, training data size, and the mixture ratio. They also devise a way to extend the scaling law for cross-domain adaptation, which entails adaptation on one domain followed by inferencing on a different domain. They demonstrate the effectiveness of the law in predicting the validation loss trends for a model of a given size.

Read the full paper here

Reversing the Forget-Retain Objectives: An Efficient LLM Unlearning Framework from Logit Difference
The strong knowledge retention capabilities of LLMs are well-known, with their associated privacy and copyright risks. The research field of “model unlearning” works on training a model to “forget” some documents from the training data, while teaching it to “retain” the rest, from which the model is expected to not retain any knowledge from the “forget” part while preserving its utility otherwise. In this work, the authors discuss the problems of degeneration and catastrophic forgetting associated with the previous methods and propose their LLM-based framework to track a target model’s learning and unlearning. They train an assistant LLM on the opposite objective (learning “forget”, forgetting “retain”) which then controls the learning loss of the target LLM to be trained. They show the effectiveness of this method on the aforementioned issues while also demonstrating more stable and efficient training.

Read the full paper here

February 2025

Convolutional Differentiable Logic Gate Networks
A continuous proliferation of research in deep learning methods has resulted in ever-increasing applications and business opportunities. Among a wave of recent works that have brought forth a dramatic rise in scale, capability, and efficiency of models, few works instead seek to redefine the fundamentals that compose these technologies. In this work, the authors demonstrate a promising paradigm of efficient deep learning methods to reduce computational and energy costs drastically while maintaining performance. Extending their previous work, they formulate CNN-like neural networks solely built using deep logic gate operations such as NAND, OR, and XOR. They tie their evaluation down to hardware efficiency, demonstrating an increase in speed-up to 1900x while maintaining the SoTA performance level on a machine vision problem.

Read the full paper here

RHO-1 – Not All Tokens Are What You Need
Diverging from the standard “causal language modeling” used to pre-train language models in which every token is affected with the same amount of loss, this work instead proposes a “selective language modeling” approach that entails identifying and removing “unimportant” tokens from the loss. The authors, showing the presence of irregularities in token-wise losses on continual pretraining of LLMs, propose this selective strategy for a more effective domain adaptation. Focusing on the mathematics domain, they first train a reference language model on a high-quality corpus. High perplexity tokens as perceived by this reference are then removed from loss calculation while training the target model. In evaluation, they show improvement on seven benchmarks of math reasoning.

Read the full paper here

January 2025

Larger language models do in-context learning differently
For customizing application-specific models, “in-context learning” in today’s LLMs makes it possible to only provide the model with a few examples as part of the input context, without any internal parameter update that was necessary earlier. In this exploratory work, the authors dig deeper into what factors make the models effective in learning from a given context. They measure how adversely can models be affected as a result of flipped labels in the context, and how much can they retain the performance if the semantic meaning of the labels is changed. Experimenting with five LLM families, they showcase the greater ability of larger models to absorb changes from a given context. Interestingly, they also demonstrate how instruction-tuning results in a greater ability to use prior knowledge while also learning an input-label mapping from context.

Read the full paper here

Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach
An adequate context is necessary for LLMs to solve a given problem. Notwithstanding, the size of context that can be provided is limited by the capacity of the LLM being used, and the budget – as the compute cost increases with size. The popular “Retrieval Augmented Generation (RAG)” technique involves retrieving and supplying only relevant chunks of information at the input, at some loss in the recall. In this work, the authors point out the ability of recent LLMs to directly process longer contexts of up to a million tokens resulting in a better performance, but which is costlier than RAG. Complementing the benefits of the two methods, they develop and propose the “self-route” technique, in which an LLM decides for itself whether it can solve the given problem using RAG, opting for a much longer context if not.

Read the full paper here

December 2024

Measuring Psychological Depth in Language Models
Evaluation of AI-generated stories has largely revolved around the estimation of coherence, style and diversity. Although sufficient for preliminary screening, they fail to evaluate the narrative power of content. To thus measure the psychological impact stories can have on readers, the authors define a Psychological Depth Scale (PDS), which consists of 5 components for measurement: empathy, engagement, emotion provocation, authenticity and narrative complexity. They establish the validity of this evaluation suite through a study involving consistent human-human correlations. Beyond this, they also show some promise in LLM-based evaluation of PDS, with some SoTA LLMs achieving decent levels of correlation with human judgements. They find stories generated by GPT-4 to attain notably high performance levels in the PDS evaluation suite.

Read the full paper here

How Does Quantization Affect Multilingual LLMs?
Under-served languages in NLP are referred to as being “low resource double bind”, implying the scarcity of data as well as computational resources for processing. LLMs have made it possible to achieve good performance without using a lot of data, and the technique of quantization serves to significantly reduce the cost and inference speed while serving LLMs – naturally, at a small performance cost. Prior works study such efficiency-performance tradeoffs, but primarily in English. In cognizance of this, authors conduct a broad study, comparing competent multilingual LLMs (103B, 35B, 8B parameters) on 5 quantization techniques across 4 different tasks, involving up to 10 languages. Through their multiple findings, they highlight the importance of the multilingual factor while measuring quantization tradeoffs, to not drastically undermine the low-resource performance.

Read the full paper here

November 2024

Thinking LLMs: General Instruction Following with Thought Generation
As humans, we’re inclined to think in our unique ways when we approach any given problem. In LLMs, the efforts so far have largely been around the popular “Chain-of-Thought” paradigm, where a model “thinks” step-by-step before arriving at the final solution. Although effective in solving math, such a stepwise approach has its limitations on general problems in the wild. In this work, the authors propose a way to enable a model to explore its own space of thought and work out a solution before it is given as the output. With their method, the model is optimized to provide an acceptable response, but without any explicit supervision on the thought content that goes before it, thus allowing the model to come up with its unique approach to the problem. The resultant model is shown to be more effective than its counterpart baseline on a variety of tasks spanning translation, reasoning, general knowledge and creativity to name a few.

Discover the full paper here

MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling
Languages are represented in LLMs (and their likes) as encoded sequences of tokens. These tokens are, in effect, groups of characters created from a huge multilingual corpus. The vocabulary of such a token set is computed from the corpus to encode an average piece of text using the least possible number of tokens. However, the token units for low-resource languages often tend to be just tiny character chunks with little symbolism. In this work, the authors propose a fundamental change in the encoding, by using morphemes – the shortest meaningful units of a language. Working on a set of 99 languages, the authors first create a unified vocabulary of morphemes. Using this, they devise a byte-level encoding method to represent the morphemes as tokens. This method shows an increase in the equitability of encoded lengths, reduced latency, and lesser disparity across languages as compared to English.

Discover the full paper here

October 2024

Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision
In the very dynamically evolving landscape of today’s AI, we are continuously witnessing newer discoveries of the problem-solving capabilities of LLMs. So far, to enable the capability on a newer problem, the most common strategy involves leveraging a high-quality, human-supervised dataset for model training. But as obtaining such human supervision has its challenges, how far can we scale when AI models become smarter than humans on certain tasks? The authors, from their experiments on this problem, find out that is possible, or rather more effective to “scale” the learning from a weaker AI model to a stronger AI model. The primary motivation lies in generalizing from several small, already solved problems (by a weak LLM) to a difficult target problem (for a stronger LLM).

Discover the full paper here

RAFT: Adapting Language Model to Domain Specific RAG
The powerful capabilities of LLMs have invigorated the efforts to discover new frontiers for their application in different domains. Although open-source LLMs have decent capabilities for reasoning in the general domain, they struggle to replicate when it comes to reasoning in specialized areas such as legal, medical, etc. Focusing on this area of research, the authors propose combining two techniques to improve domain-specific LLM capabilities: finetuning on domain-specific texts and chain-of-thought reasoning from a provided context for generation. To increase the robustness of reasoning from incorrect contexts, the finetuning process teaches the model to reason, given a partially useful or even useless piece of context.

Discover the full paper here

September 2024

Kardeş-NLU: Transfer to Low-Resource Languages with the Help of a High-Resource Cousin – A Benchmark and Evaluation for Turkic Languages
Prevalent works in NLP use “mass multi-lingual” methods to enable the understanding of low-resource (LR) languages in models, which, however, tend to retain a significant bias towards high-resource (HR) languages. Instead, some other works often use English as a pivot to transfer the semantic understanding of a model to target LR languages. In this work, the authors propose a new approach to improve LR understanding by grouping related LR languages and using a common, linguistically related HR language as a pivot in addition to English. They apply their methods to a family of 5 Turkic LR languages – also contributing an evaluation benchmark – and use Turkish as the linguistic pivot. In experimentation, the authors compare several existing methods in literature and show the benefit of using the linguistic intermediary in each setting. Considering the structure that weaves together many of today’s relevant languages, efforts in the direction of these findings show promise for many such LR linguistic families.

Discover the full paper here

MAGE: Machine-generated Text Detection in the Wild
While there has been a proliferation of LLMs having a remarkable ability to generate fluent, meaningful texts in diverse domains, any such proliferation is much lacking in the study of methods to differentiate human written text from synthetic ones. This comprehensive work spans 7 domains, involving synthetic data from 27 LLMs for study. First, they show the inability of humans on this task, as well as that of a powerful LLM like GPT-4. They next present the capability of their methods in a series of increasingly difficult settings. Importantly, they find out that while it’s easier to differentiate when a known domain(s) and LLM agent(s) are involved, the task gets difficult with content from unknown LLM agents and more so when dealing with unknown domains of text. An LLM-paraphrased version of human-written text is shown to be the most difficult case for this problem. The methods and insights from this work show great promise in application and further research.

Discover the full paper here

August 2024

VERA: A General-Purpose Plausibility Estimation Model for Commonsense Statements
LLMs, although having a powerful capability to memorize worldly knowledge and write articulate responses, have still been found to generate outputs that defy common sense. This indicates the inherent difficulty in its reasoning. In this work, the authors propose a model tuned explicitly for the job of estimating the plausibility of a given statement. To achieve this, they first curate a dataset comprising 7.6M statements overall by repurposing 19 question-answering datasets and 2 knowledge banks that involve common sense. Finetuning a model involves training the model on 3 simultaneous objectives: (1) given a statement, tell if it’s plausible or not (2) given a group of similar statements, pick the most plausible one (3) given a random group of statements, separate the plausible and non-plausible ones. They show their best model, sized at 5B parameters to display a remarkable ability on these objectives.

Discover the full paper here

Fine-Tuned Machine Translation Metrics Struggle in Unseen Domains
The field of Machine Translation (MT) evaluation constantly sees newer additions to increasingly powerful metrics: thanks to the yearly data and shared tasks from WMT (Workshop on MT). As of today, the best metrics have achieved an almost human-like judgment, and have far replaced classical metrics like BLEU. This work, however, asks and answers an important question: are these metrics equally reliable on multiple domains of data? While the WMT data largely comprises the news domain, the authors curate a dataset in the biomedical domain having 25k judgments spanning 11 language pairs and 21 MT systems. To make their point, they compare the relative gains in performance w.r.t. BLEU-like metrics and find that WMT-tuning on metrics results in a degradation in the gain on biomedical data. This work thus opens up a critical research question in this field, of maintaining the metric reliability on unseen domains of MT data.

Discover the full paper here

July 2024

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
With an ever-growing stream of newer LLMs being developed, each claimed to be better than its predecessor, this work steps in to address a critical gap: how do we evaluate LLMs in a way that’s reliable, quick, and inexpensive? An expert human judgment does fulfill the first criteria but is impractical for continuous benchmarking on the LLM landscape. The authors, through an annotated dataset for multi-turn question answering, show the merit of an LLM like GPT-4 to obtain explainable evaluations that are on the same level as a human-human agreement. From their studies, they also discuss a few limitations & biases and some possible ways of mitigating their effects. Besides, the authors also introduce another dataset based on the Chatbot Arena, an interactive crowd-sourced tool to collect human preferences, given responses from two anonymous LLM agents at a time.

Discover the full paper here

From Local to Global: A Graph RAG Approach to Query-Focused Summarization
While LLMs are getting good at answering general questions, they still need help from a domain-specific corpus as a knowledge bank to answer specialized questions. In contrast to the existing “RAG (Retrieval Augmented Generation)” approach of retaining the knowledge bank in the plain textual format, this work, through their “Graph RAG” approach, proposes to instead represent it in a structured, graph-based format. To build the graph, the authors use an all-LLM approach to weave together communities of related concepts in the domain. All such concepts or entities and the relations between them are automatically detected by LLMs from the knowledge bank. When asked a question, the communities are queried individually to get community-specific answers, all of which are then compiled together into a unified global answer. The authors show the benefit of this paradigm in increasing the comprehensiveness and diversity of the answers.

Discover the full lecture here

June 2024

PolyVoice: Language Models for Speech to Speech Translation
PolyVoice represents a leap in speech-to-speech translation by utilizing a novel language model-based approach. Unlike traditional systems that separate the task into ASR, MT, and TTS processes, PolyVoice integrates these steps into a single framework using discretized speech units. While the quality of the proposed method might not yet fully match that of cascaded systems, this novel, integrated approach has the potential to greatly enhance translation quality and reduce latency, preserving voice characteristics and style from source to translation. Furthermore, PolyVoice’s approach shows significant promise for supporting languages that are currently underrepresented or lack a written form, making it a crucial step toward more inclusive language technologies.

Discover the full paper here

MM-LLMs: Recent Advances in Multi Modal Large Language Models
In a comprehensive survey, Zhang et al. explore the burgeoning field of MultiModal Large Language Models (MM-LLMs). These models integrate diverse data types, from text to images and audio, enhancing AI’s ability to understand and generate complex multimodal content. MM-LLMs outperform traditional systems by leveraging pre-trained unimodal models resulting in more sophisticated language understanding. The paper highlights significant advancements, key design strategies, and future directions, positioning MM-LLMs as pivotal in achieving more nuanced and human-like AI interactions.

Discover the full paper here

More from Imminent

Imminent Research Grants

$100,000 to fund language technology innovators

Imminent was founded to help innovators who share the goal of making it easier for everyone living in our multilingual world to understand and be understood by all others. Each year, Imminent allocate $100,000 to fund five original research projects to explore the most advanced frontiers in the world of language services. Topics: Language economics – Linguistic data – Machine learning algorithms for translation – Human-computer interaction – The neuroscience of language.

Apply now

AI in Context

Imminent Readings

Eager to know more? Here, you will find a selection of articles from top newspapers, research publications, and leading magazines from around the world, exploring AI’s impact on language, culture, geopolitics, and economies.

Dive deeper

Imminent Research Reports

A journey through localization, technology, language, and research.

An essential resource for leaders and a powerful tool for gaining deeper knowledge and understanding of the multicultural world we live in.

Get your copy now

Curated monthly by our team of forward-thinking researchers, the latest in academic insights on language and technology: a deep dive into ideas on the brink of change about large language models (LLM), machine translation (MT), text-to-speech, and more.

More from Imminent

Imminent Research Grants

$100,000 to fund language technology innovators

AI in Context

Imminent Readings

Imminent Research Reports

A journey through localization, technology, language, and research.

Log into your account

Sign up to Imminent

Reset your password

Language is what makes us human.