Research
In our Conference Corner, readers will find Imminent’s take on the most important conferences in language research. Each edition highlights the most interesting talks, notable papers, and emerging trends presented at these events. Whether you’re exploring advances in linguistics, NLP, or broader language sciences, our curated summaries provide a clear and engaging snapshot of the ideas and innovations shaping the field.
NeurIPS 2025
This flagship conference represents the leading event for cutting-edge empirical and theoretical advances in Machine Learning, AI scaling, Foundation Models, Reinforcement Learning, and Multimodal Systems.
Attending NeurIPS 2025 provided resources, motivation, and strategic context to refine quality estimation models, integrate advanced architectures into data pipelines, and advance multimodal models.
Emerging Trends
At NeurIPS 2025, several emerging areas gained prominence, particularly LLM efficiency, alignment for evaluation, ethics and reliability, and reasoning capabilities. These topics reflect the field’s shift toward large-scale real-world deployment, where models must overcome issues such as unreliable reasoning, privacy risks, and high computational costs. Research focused on making LLMs more practical through efficiency techniques like deeper architectures and quantization, improving evaluation through better alignment and scrutiny of LLM-as-judge systems, and addressing trust concerns including data memorization, privacy, and the risk of output homogenization across models.
Highlights from Talks and Papers
A standout oral, “Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?”, challenged assumptions about RLVR, showing that reinforcement learning can improve performance but does not fundamentally extend reasoning beyond the base model’s inherent capabilities.
Scaling models remains a powerful driver of new capabilities. The Best Paper, “1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities”, demonstrated that extreme network depth (up to 1024 layers) can dramatically boost goal-conditioned self-supervised robotic performance. Meanwhile, “Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)” revealed convergence patterns in LLM outputs, underscoring the need for diversity checks in MT and QA tasks.
Efficiency and trust were also key focuses. Talks on scaling data quality and privacy/legal risks in generative AI offered frameworks for high-volume data validation and strategies to mitigate privacy exposure, addressing compliance and production challenges. Posters like “LittleBit: Ultra Low-Bit Quantization via Latent Factorization” and “Do LLMs Really Forget? Evaluating Unlearning with Knowledge Correlation and Confidence Awareness” introduced ultra-low-bit quantization and rigorous unlearning evaluations, enabling efficient models while safeguarding sensitive information.
Overall, the conference emphasized the dual challenge of pushing LLM capabilities while ensuring reliability, ethical safeguards, and efficient deployment, highlighting that advances in reasoning, scale, and data stewardship are tightly intertwined.
Conference Main Themes
The conference highlighted the gap between benchmark performance and real-world reliability, stressing the need for large models that are efficient, practical, and broadly usable.
Trust, safety, and responsible deployment were key concerns, alongside smarter architectures and evaluation strategies. Human-AI collaboration emerged as a focus, combining human oversight with AI scalability.
New Resources
NeurIPS 2025 highlighted key open-source resources, including the Infinity-Chat dataset for diversity-aware generation, the aformentioned LittleBit ultra-low-bit quantization toolkit for efficient edge models, and the Toloka hybrid human-AI annotation platform for scalable data curation. These tools and datasets represent some of the most impactful contributions for practical NLP and AI research.
EURIPS 2025
EurIPS 2025, the first European satellite event of flagship event NeurIPS held in Copenhagen, provided a unique strategic advantage. By concentrating around 2000 members of the European AI talent pool and featuring a selective program (38 Orals, 241 posters), it offered high-density networking and access to scientific advancement in AI within an EU context.
Emerging Trends
At EurIPS 2025, Sustainable AI was a central theme, focusing on efficiency, adaptability, and environmental responsibility. Emtiyaz Khan’s talk, “Adaptive Bayesian Intelligence and the Road to Sustainable AI” introduced continual learning methods that reduce the need to retrain models from scratch, while Sepp Hochreiter’s “Sustainable, Low-Energy, and Fast AI Made in Europe” presented xLSTM, an efficient alternative to Transformers that lowers energy use, speeds inference, and handles longer contexts for practical, low-energy AI deployment.
Highlights from Talks and Papers
Some papers and talks The Art of (Artificial) Reasoning, by Yejin Choi who argues that it is possible to democratize generative AI transcending current Scaling Laws, which implies the only path is extreme scaling of resources, by innovating with unconventional data, algorithms and collaboration. She collaborated on her position presenting ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models demonstrating it is possible to push reasoning capabilities of small models with careful RL. For data she presents Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning demonstrating is possible to scale performance scaling synthetic data generation, and as example of extreme collaboration she present OpenThoughts: Data Recipes for Reasoning Models, a dataset created coordinating many universities, companies and startups.
Some interesting talk and posters, focusing on techniques directly applicable in language processes:
Inference-Time Hyper-Scaling with KV Cache Compression – boosts reasoning accuracy by compressing Transformer’s KV caches to allow more token generation within the same memory footprint. The proposed method, named Dynamic Memory Sparsification (DMS), sparsifies KV caches achieving up to 8x compression by teaching pre-trained models which tokens can be scheduled for future eviction.
Beyond Oracle: Verifier-Supervision for Instruction Hierarchy in Reasoning and Instruction-Tuned LLMs (paper) – introduces a unified framework that improves instruction hierarchy in LLMs (e.g: System vs User) by utilizing programmatically verifiable signals instead of costly oracle labels. The method introduces a synthesis pipeline to create conflicting instruction pairs and executable verifiers (i.e: python functions), ensuring dataset quality through an automated unit testing and repair. Using RL with verifiable rewards with this dataset significantly enhances adherence to complex directives and safety robustness.
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention (paper) – The research introduces a parallel generation method that allows multiple LLM instances to collaborate via a shared, concurrently-updated KV cache. The system enables workers to decide a shared strategy and synchronize seeing each other’s memory (KV entries) without requiring additional fine-tuning or recomputation. This new promising approach enables effective and efficient collaboration between multiple LLMs boosting accuracy on reasoning tasks.
Conference Main Themes
Beyond the central theme of Sustainable AI, the conference maintained a strong balance across a broad range of topics. The program featured diverse research areas including Model Optimization and Representation Learning, as well as specialized fields such as Computer Vision, Diffusion Models, Graph Neural Networks, Causal Inference, and Reinforcement Learning.
New Resources
EurIPS 2025 highlighted key open-source resources, including TabArena LivingBenchmark – a living benchmark for ML on Tabular Data, advocating for evolving evaluation benchmarks over static ones. Although applied to tabular data, the specific challenges addressed in the paper are likely applicable to other domains as well.
EMNLP 2025
EMNLP is one of the world’s leading conferences on empirical natural language processing, with a strong focus on practical results, large-scale experimentation, and emerging applications of language technologies. Its relevance is immediate: the conference spans machine translation, multimodality, model evaluation, dataset creation, ethical considerations, and the fast-evolving ecosystem of large language models. This year it was held in Suzhou, China.
EMNLP also hosts the Workshop on Machine Translation (WMT), the premier annual workshop on machine translation. Tracking WMT findings has long informed our research directions in quality estimation. Attending EMNLP 2025 therefore offered not only inspiration but also direct insight into the future landscape of applied NLP.
Emerging Trends
One of the clearest cross-cutting themes this year was LLM efficiency. As models grow ever larger, researchers are increasingly focused on making them faster, lighter, and more accessible. Topics such as advanced KV-caching, extreme quantization, and improved distillation methods stood out. These approaches aim to reduce computation, cut memory needs, and make high-performance models usable on modest hardware — an essential development for many applied research teams.
Another emerging space is the use of LLMs in every stage of the research pipeline, from synthetic dataset creation to automated evaluation. This shift is reshaping traditional workflows and prompting new questions about how the field measures quality, novelty, and reliability.
Highlights from Talks and Papers
Among the most memorable keynotes was Heng Ji’s “No more processing. Time to discover.” Speaking from the vantage point of drug discovery and the broader “AI for Science” movement, she challenged the community to re-center research around genuine breakthroughs rather than incremental improvements. Her call for models and methods that support true scientific discovery resonated strongly throughout the conference.
From an MT perspective, Lonyue Wang (Alibaba International) delivered a timely talk on the evolution of multilingual translation in the LLM era. He described the shift from fine-tuned LLMs to reasoning-driven models and now to LLM-based agents — while also highlighting the persistent gap between academic MT benchmarks and the needs of industrial-scale translation systems.
On the technical front, the paper “S1: Simple Test-time Scaling” drew considerable attention. It presents an open-source method for boosting model performance by increasing compute only at inference time — an ability previously demonstrated by proprietary systems such as OpenAI’s o1. With a modest finetuning set of just 1,000 questions on Qwen2.5-32B, the authors achieved competitive gains, opening the door to new experimentation strategies for many labs.
Here is a list of other research that stood out during the conference:
- Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index
- Mind the Value-Action Gap: Do LLMs Act in Alignment with Their Values?
- Efficient Model Development through Fine-tuning Transfer
- Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models
- Evaluating Language Translation Models by Playing Telephone
- Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement
Conference Main Themes
Beyond efficiency and LLM-centric pipelines, several research threads recurred across workshops and oral sessions.
From the works that stood out from the sea of incremental research pieces, the common themes involved explorations inside LLM building blocks to hypothesize newer approaches in the architecture (e.g. a mixture of heterogenously-sized experts rather than homogenous ones in practice), and interpretations from LLM’s internal representations for their implications on surface-level output.
New Resources
For quality estimation work, several new models merit improved capabalities, including NVIDIA’s Qwen-MQM, Google’s MetricX-25 and GemSpanEval, and an RL-based MT evaluation model presented by researchers from the University of Amsterdam. Together, these resources represent some of the most relevant and forward-looking contributions to EMNLP and WMT 2025.
