Translated's Research Center

Translation AI: Emerging Trends

Translated’s Deep Learning Scientist Sagar Joshi unpacks the rise of multilingual Transformer models—and their future in AI.

Research

The science of Artificial Intelligence, with its astounding evolution over the last three decades, stands as a testament to the sophisticated advancement of humankind. Likewise, the evolution of solving the “machine translation” problem reflects a significant advancement in the capabilities of AI technology. Inherently difficult, solving this problem requires a deep understanding and expertise of the languages to be translated from and into, along with reasoning abilities to understand the context in which the text is to be translated.

1. Historical Background

Rule-based Systems

Although the origins of automatic translation itself can be traced back to as early as the 9th century, an interesting turn in the history of modern machine translation occurred in 1954, with the Georgetown-IBM experiment demonstrating promising capabilities to translate from romanized Russian to English. The experiment proposed a “rule-based” system, translating sentences according to hardcoded grammar rules. Though initially promising, it was not followed by any significant improvements over the next few decades, and the interest in machine translation dwindled for a while.
The approaches developed during this period revolved around learning from “parallel texts” (i.e. texts having corresponding sentences in both the source and target language) enabling translation by substituting words and phrases in a word-by-word or phrase-by-phrase manner.
Such substitutions were based on a bank of linguistic rules, or analogies derived from examples in a parallel text.

Statistical Models

Following several decades of research into rule-based translation systems, the 1980s saw rapid advancements in computer processor chips. Increased computational capability of CPUs then paved the way for “Statistical Machine Translation” (Peter F. Brown et al. 1990, Peter F. Brown et al. 1993), which built statistical models to implicitly encode translation rules in contrast to earlier hard-coded approaches. Having their roots in information theory mathematics, the methods aim to maximize the likelihood of a target translation to be produced, based on word-by-word alignments learned
from large parallel texts.

The heavy computational nature of this paradigm led to maximizing the learnings that could be mined from large-sized parallel texts without any laborious human effort. This led to many translation systems with impressive outputs and commercial applications. Moses is a popular open-source translation toolkit from this paradigm (Philipp Koehn et al. 2007).


Imminent Science Spotlight

Imminent Science Spotlight

Academic lectures for language enthusiasts

Curated monthly by our team of forward-thinking researchers, the latest in academic insights on language and technology: a deep dive into ideas on the brink of change about large language models (LLM), machine translation (MT), text-to-speech, and more.

Dive deeper

2. Into the Deep Learning Era

A New Beginning

The continuous advancement of computational systems, and especially GPU systems with much more processing power, ushered in the realization of “Deep Learning” systems for machine translation. In a broad sense, these models use a deep stack of parameterized “layers,” each trained to represent its output in a “latent space,” which is essentially a high-dimensional vector space that stores condensed information. The model, with such a multilevel edifice of representations, ultimately builds a deep, semantically rich complex of information from its provided input. The more potent these representations are, the greater the model’s ability to generate high-quality output. Early efforts (Holger Schwenk et al. 2006, Hai Son Le et al. 2012) in machine translation involved using rudimentary neural network architectures to complement statistical systems and present preliminary hypotheses for machine translation modeled using such architectures. However, statistical systems still continued to dominate the machine translation landscape through this phase.

Sequential Models

Deep learning applications for machine translation gained momentum with the introduction of Recurrent Neural Networks (RNNs) as a replacement for statistical language modeling (Tomáš Mikolov. 2012, Michael Auli et al. 2013). RNNs use a “sequential” paradigm for processing text, where the representation of a sentence is built in a token-by-token sequence, which is a more natural way to process and understand textual info. An “encoder-decoder” architecture (Kyunghyun Cho et al. 2014) was proposed using an RNN-based model, which also gained wide popularity in various other applications. This architectural style consists of a sequence encoder, which first condenses all the information from the source language into a single information vector, and a sequence decoder, which uses the vector to output a token-by-token translation in the target language. Modeling the architecture with Long Short-Term Memory (LSTM) units (Ilya Sutskever et al. 2014) increased its capacity for handling longer sequences.


It is important to note that the Transformer architecture, which forms the bedrock of today’s AI landscape, was first modeled on the machine translation problem to establish its potency in processing and representing information.


Attention: The Transformer!

The most interesting discovery during this period was the ability to “soft-select” (Dzmitry Bahdanau et al. 2014) relevant parts of the source sequence at every step of translating – in addition to using a condensed version of the full source. The approach to modeling was intuitive, as practically, only a small part of the source requires more “attention” while translating a text token-by-token. This small but important addition led to significant improvements (Yonghui Wu et al. 2016), and the architecture was adopted to tackle a wide range of deep learning problems, resulting in new state-of-the-art performances. The effectiveness of this attention mechanism led, in turn, to the introduction of the “Transformer” architecture (Ashish Vaswani et al. 2017), which revolutionized modeling approaches and continues to push performance boundaries. With this architecture, the attention mechanism became the primary modeling unit (not a mere addition!), eschewing the earlier sequential paradigm. Every token in a textual sequence was now computing its own representation all-at-once by relying on only some soft-selected parts of the sequence. The encoder-decoder framework continued with this approach, with each layer in the encoder and decoder stacks now using attention-based Transformer blocks instead of sequential RNN-like cells.

General-Purpose Multi-Lingual Models

It is important to note that the Transformer architecture, which forms the bedrock of today’s AI landscape, was first modeled on the machine translation problem to establish its potency in processing and representing information. Not only was the architecture found to be a great way to model complex problems, it was also easily scalable – an increase in the model size with a simultaneous increase in the data used for training could result in an increase in the model’s capabilities and performance. The model could be “pre-trained” on large amounts of data (without any parallel text or labels) to first develop a general-purpose understanding of human language and the world. The resulting pre-trained model could then be “finetuned” for specific tasks of interest. Prior research in machine translation (Çaglar Gülçehre et al. 2015, Rico Sennrich et al. 2016, Ye Qi et al. 2018) had proposed a similar “pretrain → finetune” learning paradigm to leverage large untapped monolingual corpora, but without much promise for extensibility. With the advent of Transformers, there was a proliferation of general-purpose pre-trained models — ones based only on the encoder stack (Jacob Devlin et al. 2019, Yinhan Liu et al. 2019), only on the decoder stack (Alec Radford et al. 2018, Alec Radford et al. 2019, Tom B. Brown et al. 2020), or on the complete encoder-decoder stack (Colin Raffel et al. 2019, Mike Lewis et al. 2020) of Transformer blocks.

Early attempts at leveraging the general-purpose pre-trained models to improve machine translation (Rongxiang Weng et al. 2019, Jinhua Zhu et al. 2020) yielded only faint, unsteady improvements to the already strong vanilla Transformer-based systems. More important was the development of “multilingual models” — general-purpose models that could understand and process multiple languages through shared parameters (Telmo Pires et al. 2019, Alexis Conneau et al. 2020). The ability of these models to transfer their learning from one language to the other without an explicit tuning in the target language led to important developments in “few-shot machine translation” (teaching a model to translate with very few examples) and even “unsupervised machine translation” (a model capable of translating without any specific training).
These advancements provided great hope for the development of long-neglected, low-resource languages (Kaitao Song et al. 2019, Yinhan Liu et al. 2020). “No Language Left Behind” (NLLB Team. 2022) is an important contribution to low-resource language translation.


Imminent Research Report 2025

Imminent Research Report 2025

A journey through localization, technology, language, and research.

The ultimate resource for understanding AI's impact on human interaction. Written by a global community of multidisciplinary experts. Designed to help navigate the innovation leap driven by automatic, affordable, and reliable translation technology.

Secure your copy now

3. Present-Day Large Language Models

Thanks to an accelerating industry interest, the horizon of general purpose multilingual Transformer models keeps expanding, pushing new frontiers every few weeks. Researchers have found it useful (Jason Wei et al. 2021, Long Ouyang et al. 2022) to utilize only the decoder stack of the Transformer as a language model, scaling it to billions of parameters and pre-training it on a corpus of data encompassing trillions of tokens. The astonishing capabilities of the resulting “Large Language Models (LLMs)” (Susan Zhang et al. 2022, Teven Le Scao et al. 2022, Aakanksha Chowdhery et al. 2023, OpenAI. 2024, Albert Q. Jiang et al. 2024, Abhimanyu Dubey et al. 2024, Xiang Yue et al. 2024) have ushered in another paradigm shift: a textual prompt-based modeling approach (Pengfei Liu et al. 2021) to solve every problem. Given a simple prompt explaining the problem, the model will generate its solution, token-by token! This has important implications for machine translation (Keqin Peng et al. 2023, Biao Zhang et al. 2023, Wenhao Zhu et al. 2024), which now enables us to not only translate a given piece of text, but also explicitly condition that translation with the purpose, context, and stylistic considerations using simple descriptions in the prompt.

Problems, Opportunities & the Future

Research in deep learning, meanwhile, continues alongside the promising development of LLM-based applications. Some common challenges include implementing a “chain-of-thought” (Jason Wei et al. 2022) reasoning technique to inculcate a step-by-step approach to problem-solving in the models, and adopting the “retrieval augmented generation” (Patrick Lewis et al. 2020) approach to equip the model with problem-specific knowledge in the prompt before having it generate the solution. Given the enormous sizes of LLMs, “parameter-efficient” methods (J. E. Hu et al. 2021) to further tune the models for custom applications, “quantization” techniques (Benoit Jacob et al. 2017) to reduce the representational space taken up by the models, and other challenges that seek compute-efficient methods for utilizing large models (Tri Dao et al. 2022, Woosuk Kwon et al. 2023) are a critical to study. Some other important problems of today include “model unlearning” (Joel Jang et al. 2023) to prevent models from retaining privacy-sensitive information, designing comprehensive evaluation frameworks (Lianmin Zheng et al. 2023) for unbiased evaluation of LLM performance, and challenging the fundamentals of today (Albert Gu et al. 2022, Felix Petersen et al. 2022) to develop even more potent modeling architectures. This list, however, hardly scratches the surface, and is ever-expanding in the rapidly evolving deep learning landscape. Nevertheless, machine translation remains central to the dividends of present-day research. The quest to solve this problem has driven a lot of the early advancements in AI and deep learning, and continues to lure the researchers of today with its intricacies. In contrast to its humble beginnings in the 1950s, the problem now summons us to solve newer, much deeper complexities. Still, the ultimate goal is far away — the creation of a universal translator capable of matching or exceeding the best human translator capabilities, from every human language into another. Only then will we have what we truly seek — a world connected through channels that transcend all linguistic and cultural barriers, where streams of knowledge and communication flow freely across borders.

Sagar Joshi

Sagar Joshi

Deep Learning Scientist at Translated

Sagar Joshi works as a Deep Learning Scientist at Translated, focusing on problems in machine translation and its evaluation. Previous to this experience, he completed a Master’s at IIIT Hyderabad, where he published a thesis on generative methods in deep learning for legal contracts.

REFERENCES

  1. Peter F. Brown, John Cocke, Stephen Della Pietra, Vincent J. Della Pietra, Frederick Jelinek, John D. Lafferty, Robert L. Mercer and Paul S. Roossin. 1990. A Statistical Approach to Machine Translation. Computational Linguistics.
  2. Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics.
  3. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions.
  4. Holger Schwenk, Daniel Dechelotte, and Jean-Luc Gauvain. 2006. Continuous Space Language Models for Statistical Machine Translation. In Proceedings of the COLING/ ACL 2006 Main Conference Poster Sessions.
  5. Hai Son Le, Alexandre Allauzen, and François Yvon. 2012. Continuous Space Translation Models with Neural Networks. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
  6. Tomáš Mikolov. 2012. Statistical Language Models Based on Neural Networks. Ph.D. Thesis. Brno University of Technology, Faculty of Information Technology.
  7. Michael Auli, Michel Galley, Chris Quirk, and Geoffrey Zweig. 2013. Joint Language and Translation Modeling with Recurrent Neural Networks. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.
  8. Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).
  9. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems.
  10. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations.
  11. Yonghui Wu, Mike Schuster, Z. Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, et al. 2016. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. ArXiv abs/1609.08144.
    Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin. 2017. Attention is All you Need. Neural Information Processing Systems.
  12. Çaglar Gülçehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loïc Barrault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk and Yoshua Bengio. 2015. On Using Monolingual Corpora in Neural Machine Translation. ArXiv abs/1503.03535.
  13. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving Neural Machine Translation Models with Monolingual Data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.
  14. Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neubig. 2018. When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation?. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
  15. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
  16. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics.
  17. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv abs/1907.11692.
  18. Alec Radford and Karthik Narasimhan. 2018. Improving Language Understanding by Generative Pre-Training. OpenAI.
  19. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. OpenAI.
  20. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language Models are Few-Shot Learners. Neural Information Processing Systems.
  21. Colin Raffel, Noam M. Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li and Peter J. Liu. 2019. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of machine learning research.
  22. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
  23. Rongxiang Weng, Heng Yu, Shujian Huang, Shanbo Cheng and Weihua Luo. 2019. Acquiring Knowledge from Pre-trained Model to Neural Machine Translation. AAAI Conference on Artificial Intelligence.
    Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin, Wen-gang Zhou, Houqiang Li and Tie-Yan Liu. 2020. Incorporating BERT into Neural Machine Translation. International Conference on Learning Representations.
  24. Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How Multilingual is Multilingual BERT?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale.
  25. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
    Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu and Tie-Yan Liu. 2019. MASS: Masked Sequence to Sequence Pre-training for Language Generation. International Conference on Machine Learning.
  26. Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual Denoising Pre-training for Neural Machine Translation. Transactions of the Association for Computational Linguistics.
  27. Nllb Team, Marta Ruiz Costa-jussà, James Cross, Onur Celebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Philipp Koehn, et al. 2022. No Language Left Behind: Scaling Human-Centered Machine Translation.
  28. Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai and Quoc V. Le. 2021. Finetuned Language Models Are Zero- Shot Learners. International Conference on Learning Representations.
  29. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike and Ryan J. Lowe. 2022. Training language models to follow instructions with human feedback. Neural Information Processing Systems.
  30. Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona T. Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang and Luke Zettlemoyer. 2022. OPT: Open Pre-trained Transformer Language Models. ArXiv abs/2205.01068.
  31. Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili’c, Daniel Hesslow, Roman Castagn’e, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. 2022. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. ArXiv abs/2211.05100.
  32. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. PaLM: scaling language modeling with pathways. Journal of machine learning research.
  33. OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, et al. 2023. GPT-4 Technical Report.
    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of Experts. 2024. ArXiv abs/2401.04088.
  34. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The Llama 3 Herd of Models. ArXiv abs/2407.21783.
  35. Xiang Yue, Yueqi Song, Akari Asai, Seungone Kim, Jean de Dieu Nyandwi, Simran Khanuja, Anjali Kantharuban, Lintang Sutawika, Sathyanarayanan Ramamoorthy and Graham Neubig. 2024. Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages. ArXiv abs/2410.16153.
  36. Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi and Graham Neubig. 2021. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Computing Surveys.
  37. Keqin Peng, Liang Ding, Qihuang Zhong, Li Shen, Xuebo Liu, Min Zhang, Yuanxin Ouyang, and Dacheng Tao. 2023. Towards Making the Most of ChatGPT for Machine Translation. In Findings of the Association for Computational Linguistics: EMNLP 2023.
  38. Biao Zhang, Barry Haddow, and Alexandra Birch. 2023. Prompting large language model for machine translation: a case study. In Proceedings of the 40th International Conference on Machine Learning (ICML’23).
  39. Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li. 2024. Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis. In Findings of the Association for Computational Linguistics: NAACL 2024.
  40. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS ‘22).
    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge- intensive NLP tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS ‘20).
  41. J. E. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. International Conference on Learning Representations.
  42. Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew G. Howard, Hartwig Adam and Dmitry Kalenichenko. 2017. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra and Christopher R’e. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. Neural Information Processing Systems.
  43. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP ‘23).
    Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. 2023. Knowledge Unlearning for Mitigating Privacy Risks in Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.
  44. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-judge with MT-bench and Chatbot Arena. In Proceedings of the 37th International Conference on Neural Information Processing Systems (NIPS ‘23).
  45. Albert Gu, Karan Goel and Christopher Ré. 2022. Efficiently Modeling Long Sequences with Structured State Spaces. International Conference on Learning Representations.
  46. Felix Petersen, Christian Borgelt, Hilde Kuehne and Oliver Deussen. 2022. Deep Differentiable Logic Gate Networks. Neural Information Processing Systems.