Curated monthly by our team of forward-thinking researchers, the latest in academic insights on language and technology: a deep dive into ideas on the brink of change about large language models (LLM), machine translation (MT), text-to-speech, and more.
Do you want to receive the latest academic insights right in your inbox?
December 2024
Measuring Psychological Depth in Language Models
Evaluation of AI-generated stories has largely revolved around the estimation of coherence, style and diversity. Although sufficient for preliminary screening, they fail to evaluate the narrative power of content. To thus measure the psychological impact stories can have on readers, the authors define a Psychological Depth Scale (PDS), which consists of 5 components for measurement: empathy, engagement, emotion provocation, authenticity and narrative complexity. They establish the validity of this evaluation suite through a study involving consistent human-human correlations. Beyond this, they also show some promise in LLM-based evaluation of PDS, with some SoTA LLMs achieving decent levels of correlation with human judgements. They find stories generated by GPT-4 to attain notably high performance levels in the PDS evaluation suite.
Read the full paper here
How Does Quantization Affect Multilingual LLMs?
Under-served languages in NLP are referred to as being “low resource double bind”, implying the scarcity of data as well as computational resources for processing. LLMs have made it possible to achieve good performance without using a lot of data, and the technique of quantization serves to significantly reduce the cost and inference speed while serving LLMs – naturally, at a small performance cost. Prior works study such efficiency-performance tradeoffs, but primarily in English. In cognizance of this, authors conduct a broad study, comparing competent multilingual LLMs (103B, 35B, 8B parameters) on 5 quantization techniques across 4 different tasks, involving up to 10 languages. Through their multiple findings, they highlight the importance of the multilingual factor while measuring quantization tradeoffs, to not drastically undermine the low-resource performance.
Read the full paper here
November 2024
Thinking LLMs: General Instruction Following with Thought Generation
As humans, we’re inclined to think in our unique ways when we approach any given problem. In LLMs, the efforts so far have largely been around the popular “Chain-of-Thought” paradigm, where a model “thinks” step-by-step before arriving at the final solution. Although effective in solving math, such a stepwise approach has its limitations on general problems in the wild. In this work, the authors propose a way to enable a model to explore its own space of thought and work out a solution before it is given as the output. With their method, the model is optimized to provide an acceptable response, but without any explicit supervision on the thought content that goes before it, thus allowing the model to come up with its unique approach to the problem. The resultant model is shown to be more effective than its counterpart baseline on a variety of tasks spanning translation, reasoning, general knowledge and creativity to name a few.
Discover the full paper here
MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling
Languages are represented in LLMs (and their likes) as encoded sequences of tokens. These tokens are, in effect, groups of characters created from a huge multilingual corpus. The vocabulary of such a token set is computed from the corpus to encode an average piece of text using the least possible number of tokens. However, the token units for low-resource languages often tend to be just tiny character chunks with little symbolism. In this work, the authors propose a fundamental change in the encoding, by using morphemes – the shortest meaningful units of a language. Working on a set of 99 languages, the authors first create a unified vocabulary of morphemes. Using this, they devise a byte-level encoding method to represent the morphemes as tokens. This method shows an increase in the equitability of encoded lengths, reduced latency, and lesser disparity across languages as compared to English.
Discover the full paper here
October 2024
Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision
In the very dynamically evolving landscape of today’s AI, we are continuously witnessing newer discoveries of the problem-solving capabilities of LLMs. So far, to enable the capability on a newer problem, the most common strategy involves leveraging a high-quality, human-supervised dataset for model training. But as obtaining such human supervision has its challenges, how far can we scale when AI models become smarter than humans on certain tasks? The authors, from their experiments on this problem, find out that is possible, or rather more effective to “scale” the learning from a weaker AI model to a stronger AI model. The primary motivation lies in generalizing from several small, already solved problems (by a weak LLM) to a difficult target problem (for a stronger LLM).
Discover the full paper here
RAFT: Adapting Language Model to Domain Specific RAG
The powerful capabilities of LLMs have invigorated the efforts to discover new frontiers for their application in different domains. Although open-source LLMs have decent capabilities for reasoning in the general domain, they struggle to replicate when it comes to reasoning in specialized areas such as legal, medical, etc. Focusing on this area of research, the authors propose combining two techniques to improve domain-specific LLM capabilities: finetuning on domain-specific texts and chain-of-thought reasoning from a provided context for generation. To increase the robustness of reasoning from incorrect contexts, the finetuning process teaches the model to reason, given a partially useful or even useless piece of context.
Discover the full paper here
September 2024
Kardeş-NLU: Transfer to Low-Resource Languages with the Help of a High-Resource Cousin – A Benchmark and Evaluation for Turkic Languages
Prevalent works in NLP use “mass multi-lingual” methods to enable the understanding of low-resource (LR) languages in models, which, however, tend to retain a significant bias towards high-resource (HR) languages. Instead, some other works often use English as a pivot to transfer the semantic understanding of a model to target LR languages. In this work, the authors propose a new approach to improve LR understanding by grouping related LR languages and using a common, linguistically related HR language as a pivot in addition to English. They apply their methods to a family of 5 Turkic LR languages – also contributing an evaluation benchmark – and use Turkish as the linguistic pivot. In experimentation, the authors compare several existing methods in literature and show the benefit of using the linguistic intermediary in each setting. Considering the structure that weaves together many of today’s relevant languages, efforts in the direction of these findings show promise for many such LR linguistic families.
Discover the full paper here
MAGE: Machine-generated Text Detection in the Wild
While there has been a proliferation of LLMs having a remarkable ability to generate fluent, meaningful texts in diverse domains, any such proliferation is much lacking in the study of methods to differentiate human written text from synthetic ones. This comprehensive work spans 7 domains, involving synthetic data from 27 LLMs for study. First, they show the inability of humans on this task, as well as that of a powerful LLM like GPT-4. They next present the capability of their methods in a series of increasingly difficult settings. Importantly, they find out that while it’s easier to differentiate when a known domain(s) and LLM agent(s) are involved, the task gets difficult with content from unknown LLM agents and more so when dealing with unknown domains of text. An LLM-paraphrased version of human-written text is shown to be the most difficult case for this problem. The methods and insights from this work show great promise in application and further research.
Discover the full paper here
August 2024
VERA: A General-Purpose Plausibility Estimation Model for Commonsense Statements
LLMs, although having a powerful capability to memorize worldly knowledge and write articulate responses, have still been found to generate outputs that defy common sense. This indicates the inherent difficulty in its reasoning. In this work, the authors propose a model tuned explicitly for the job of estimating the plausibility of a given statement. To achieve this, they first curate a dataset comprising 7.6M statements overall by repurposing 19 question-answering datasets and 2 knowledge banks that involve common sense. Finetuning a model involves training the model on 3 simultaneous objectives: (1) given a statement, tell if it’s plausible or not (2) given a group of similar statements, pick the most plausible one (3) given a random group of statements, separate the plausible and non-plausible ones. They show their best model, sized at 5B parameters to display a remarkable ability on these objectives.
Discover the full paper here
Fine-Tuned Machine Translation Metrics Struggle in Unseen Domains
The field of Machine Translation (MT) evaluation constantly sees newer additions to increasingly powerful metrics: thanks to the yearly data and shared tasks from WMT (Workshop on MT). As of today, the best metrics have achieved an almost human-like judgment, and have far replaced classical metrics like BLEU. This work, however, asks and answers an important question: are these metrics equally reliable on multiple domains of data? While the WMT data largely comprises the news domain, the authors curate a dataset in the biomedical domain having 25k judgments spanning 11 language pairs and 21 MT systems. To make their point, they compare the relative gains in performance w.r.t. BLEU-like metrics and find that WMT-tuning on metrics results in a degradation in the gain on biomedical data. This work thus opens up a critical research question in this field, of maintaining the metric reliability on unseen domains of MT data.
Discover the full paper here
July 2024
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
With an ever-growing stream of newer LLMs being developed, each claimed to be better than its predecessor, this work steps in to address a critical gap: how do we evaluate LLMs in a way that’s reliable, quick, and inexpensive? An expert human judgment does fulfill the first criteria but is impractical for continuous benchmarking on the LLM landscape. The authors, through an annotated dataset for multi-turn question answering, show the merit of an LLM like GPT-4 to obtain explainable evaluations that are on the same level as a human-human agreement. From their studies, they also discuss a few limitations & biases and some possible ways of mitigating their effects. Besides, the authors also introduce another dataset based on the Chatbot Arena, an interactive crowd-sourced tool to collect human preferences, given responses from two anonymous LLM agents at a time.
Discover the full paper here
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
While LLMs are getting good at answering general questions, they still need help from a domain-specific corpus as a knowledge bank to answer specialized questions. In contrast to the existing “RAG (Retrieval Augmented Generation)” approach of retaining the knowledge bank in the plain textual format, this work, through their “Graph RAG” approach, proposes to instead represent it in a structured, graph-based format. To build the graph, the authors use an all-LLM approach to weave together communities of related concepts in the domain. All such concepts or entities and the relations between them are automatically detected by LLMs from the knowledge bank. When asked a question, the communities are queried individually to get community-specific answers, all of which are then compiled together into a unified global answer. The authors show the benefit of this paradigm in increasing the comprehensiveness and diversity of the answers.
Discover the full lecture here
June 2024
PolyVoice: Language Models for Speech to Speech Translation
PolyVoice represents a leap in speech-to-speech translation by utilizing a novel language model-based approach. Unlike traditional systems that separate the task into ASR, MT, and TTS processes, PolyVoice integrates these steps into a single framework using discretized speech units. While the quality of the proposed method might not yet fully match that of cascaded systems, this novel, integrated approach has the potential to greatly enhance translation quality and reduce latency, preserving voice characteristics and style from source to translation. Furthermore, PolyVoice’s approach shows significant promise for supporting languages that are currently underrepresented or lack a written form, making it a crucial step toward more inclusive language technologies.
Discover the full paper here
MM-LLMs: Recent Advances in Multi Modal Large Language Models
In a comprehensive survey, Zhang et al. explore the burgeoning field of MultiModal Large Language Models (MM-LLMs). These models integrate diverse data types, from text to images and audio, enhancing AI’s ability to understand and generate complex multimodal content. MM-LLMs outperform traditional systems by leveraging pre-trained unimodal models resulting in more sophisticated language understanding. The paper highlights significant advancements, key design strategies, and future directions, positioning MM-LLMs as pivotal in achieving more nuanced and human-like AI interactions.
Discover the full paper here
More from Imminent
Imminent Research Grants
$100,000 to fund language technology innovators
Imminent was founded to help innovators who share the goal of making it easier for everyone living in our multilingual world to understand and be understood by all others. Each year, Imminent allocate $100,000 to fund five original research projects to explore the most advanced frontiers in the world of language services. Topics: Language economics – Linguistic data – Machine learning algorithms for translation – Human-computer interaction – The neuroscience of language.
Apply nowAI News for Global Citizens
Imminent Readings
Eager to know more? Here, you will find a selection of articles from top newspapers, research publications, and leading magazines from around the world, exploring AI’s impact on language, culture, geopolitics, and economies.
Dive deeperSymbiotic Connections
Imminent’s Annual Report 2024
A journey through neuroscience, localization, technology, language, and research. An essential resource for leaders and a powerful tool for going deeper in knowing and understanding the perceived trade-off between artificial intelligence and humans and on their respective role in designing socio-technical systems.
Secure your copy now!