Translated's Research Center

Unlocking Hidden Tongues

In this investigative dive for Translation Horizons, we explore groundbreaking research that's harnessing machine translation to preserve and revive Chavacano, shedding light on the broader battle to save Creole languages worldwide.


Research

How AI is Breathing New Life into Chavacano, Asia’s Lone Spanish Creole


In the bustling streets of Zamboanga City, Philippines, where colonial history lingers in every corner, a unique language echoes through markets and family gatherings. Chavacano, a vibrant Creole born from centuries of Spanish rule mixed with local Austronesian flavors, is fighting for survival in a digital world. With only about 104,000 households still using it daily, this “linguistic orphan”—a term coined by scholar John M. Lipski in 2001—faces the risk of fading away. But what if artificial intelligence could turn the tide?

In this investigative dive for Translation Horizons, we explore groundbreaking research that’s harnessing machine translation to preserve and revive Chavacano, shedding light on the broader battle to save Creole languages worldwide.

Creole languages aren’t just quirky dialects; they’re living testaments to human resilience and cultural fusion. Emerging from intense contact between colonizers and indigenous peoples—often in plantations, ports, or forts—Creoles simplify grammar while borrowing words from dominant “lexifier” languages like Spanish, French, or English. Think Haitian Creole from French influences or Jamaican Patois from English. Chavacano stands out as Asia’s only Spanish-based Creole, blending a significant portion of Spanish vocabulary with Philippine structures like verb-initial word order. Yet, with globalization pushing English and Tagalog (the basis for Filipino), Chavacano is classified as “threatened” by linguists, its use waning among the youth.

Enter a team of Filipino researchers whose work is pioneering AI-driven solutions. Based at the University of the Philippines Cebu and De La Salle University, they’ve tackled Chavacano’s digital invisibility head-on. Their recent studies, published amid growing interest in low-resource languages, reveal how multilingual neural machine translation (MNMT) — AI systems that juggle multiple languages at once — can bridge Chavacano to Spanish, English, Cebuano, Hiligaynon, and Tagalog.

The Data Hunt: Building Chavacano’s Digital Lifeline

Investigating Chavacano’s preservation starts with a stark reality: data scarcity. Most AI translation tools thrive on billions of sentences from major languages, but Creoles like Chavacano have been overlooked. “Computational studies on Chavacano are scarce due to the dearth of available corpora,” the team noted in their paper ChavacanoMT. To change that, they scraped over 767,000 parallel sentences—matching phrases across languages—from Bible translations and Jehovah’s Witness articles on jw.org. This created ChavacanoMT, a dataset that’s non-English-centric, focusing instead on Chavacano’s linguistic relatives.

Why religious texts? They’re reliably parallel, with verses aligning perfectly across tongues. The New Testament, available in Chavacano since the 1980s, provided a goldmine. But the team didn’t stop there; they preprocessed the data meticulously, cleaning punctuation and ensuring alignment. The result? A corpus with 1,044,185 samples in later iterations, hosted on platforms like Hugging Face for global access. This dataset has sparked further work, including language identification models to distinguish Chavacano from similar tongues like Spanish or Cebuano.

AI in Action: Testing the Translation Waters

Diving deeper, a follow-up research paper probes a fascinating question: which “parent” languages best boost Chavacano translation? Using the mT5 model—a versatile AI from Google—they fine-tuned it on ChavacanoMT. Ablation tests, where languages were systematically removed from training, uncovered surprises. Austronesian influences like Cebuano and Hiligaynon (local substrates) improved translation quality by 5 BLEU points (a key accuracy metric), far outpacing Spanish and English despite Chavacano’s 77% lexical overlap with Spanish.

Why the mismatch? Structure matters more than shared words. Chavacano’s grammar aligns with Philippine languages—simple tenses marked by particles like “ya” for past—making local tongues better “teachers” for AI. Spanish, with its complex conjugations, creates hurdles. The model achieved a 17 BLEU point jump in Chavacano-English translation over benchmarks, proving MNMT’s edge for Creoles. This echoes global trends. In CreoleVal, a recent benchmark covering 28 Creoles, researchers like Hannah Lent noted similar challenges: data poverty hinders AI, but multilingual approaches help. For Chavacano, this work outperforms systems for other Philippine languages despite fewer samples, highlighting relatedness as a superpower.

Beyond the Lab: Real-World Revival

Our investigation reveals AI’s role extends to everyday preservation. In Zamboanga, tools like AI-powered OCR (optical character recognition) are digitizing fading manuscripts. Transkribus, a platform using neural networks, deciphers handwritten Chavacano texts, making history accessible. Real-time translation apps break barriers for education and tourism, while crowdsourcing refines models with community idioms—ensuring AI captures Chavacano’s soul, not just words.

Recent developments amplify this. A blog from AI Translations.io spotlights how these tools foster ownership among speakers, turning preservation into a collective effort. And with platforms like AmericasNLP including indigenous languages in their shared tasks, Chavacano’s technology could inspire similar revivals.

The Bigger Picture: Creoles in a Digital Age

As we wrap our probe, one thing’s clear: AI translation isn’t just about convenience—it’s a lifeline for cultural identity. Chavacano’s story mirrors dozens of Creoles at risk, from Nigerian Pidgin to Kreol Morisien. By leveraging linguistic ties, technology can democratize preservation, making endangered voices heard in our interconnected world.

Yet challenges remain: limited sources of data may bias datasets, and broader texts are needed. Looking ahead, integrating sentiment analysis and dialect variations could refine these models further. For translators and linguists, this is a call to action—AI isn’t replacing human effort; it’s amplifying it to save what makes us uniquely human.

about the authors

Charibeth Cheng

Charibeth Cheng

Associate Dean & Professor at De la Salle University

Charibeth Cheng is the Associate Dean and a professor at De La Salle University's College of Computer Studies. A local pioneer in Natural Language Processing and Machine Learning, she co-founded Senti AI and develops computational resources for Philippine languages. Passionate about digitizing local languages, she continues to drive innovation in language technology and AI.

Aileen Joan Vicente

Aileen Joan Vicente

Assistant Professor of Computer Science, University of the Philippines Cebu

Aileen Joan Vicente is an Assistant Professor of Computer Science at the University of the Philippines Cebu. With 24 years of teaching experience, she specializes in Data Science, Natural Language Processing, and Data Mining, applying machine learning to domains such as disaster management and social media analysis. She leads projects like Firecheck, which supports fire risk mitigation, and is advancing multilingual machine translation research on the Spanish-based Creole Chavacano and other low-resource Philippine languages.