Ask a mainstream AI chatbot for directions in Quechua, or try to joke with it in colloquial Marathi, and something feels off. The words may come back technically correct, but the meaning doesn’t quite land. The response sounds like someone who learned the language formally and missed how it’s actually used.
That gap isn’t accidental. It reflects where today’s most widely used AI systems come from.
Large language models are overwhelmingly trained on English-language data, much of it drawn from formal writing, Western media, and standardised registers. When other languages appear, they tend to show up in their most polished forms: textbook Hindi, European Spanish, or standard French. Everyday speech, regional slang, oral traditions, and cultural reference points are far less visible.
For people outside those defaults, using AI often means translating yourself first.
That’s beginning to change, largely through regional efforts to rebuild the interface itself.
Across Latin America, a coalition of universities and researchers is working on LatamGPT, a regionally developed language model trained on Latin American data and contexts. The goal is not scale, but representation, and to build systems that understand how language is actually spoken across the region.
That matters in a place where Spanish varies sharply by country and class, and where millions speak Indigenous languages such as Guarani in Paraguay, Nahuatl in Mexico, or Mapudungun among Mapuche communities in Chile and Argentina. These languages carry grammatical structures, metaphors, and ways of reasoning that don’t map cleanly onto English.

The challenge goes beyond vocabulary.
In 2023, ChatGPT was asked to translate the Mexican idiom “me cayó el veinte.” The literal output, “the twenty fell on me,” missed the point entirely. What the phrase actually means is closer to “I finally got it” or “the penny dropped,” a reference to old payphones that only worked once a 20-cent coin clicked into place.
A model trained on dictionaries can translate the words. A model trained on lived languages understands the context.
That distinction explains why regional models are gaining urgency.
India faces a parallel problem at a different scale. With 22 official languages and thousands of dialects, linguistic exclusion is built into digital systems by default. The government-backed Bhashini programme aims to create open language datasets that allow translation and speech tools to function across Indian languages. Alongside it, companies like Sarvam AI are building Indic-language models trained primarily on Indian data, rather than adapting English-first systems after the fact.
These efforts mirror earlier shifts in digital adoption. WhatsApp’s success in India wasn’t just about cost. It was about accommodation. Voice notes, regional scripts, and flexible keyboards allowed people to communicate without switching registers. Users didn’t have to learn the platform. Instead, the platform learned them for the users.
Building AI that works this way requires different data and different ethics.
Much of the world’s linguistic richness isn’t archived neatly online. It exists in oral histories, local television, community radio, street signs, and WhatsApp messages. Turning that into training data raises questions of consent and ownership.
Projects like Masakhane in Africa and Karya in India approach this collaboratively, paying contributors and keeping datasets open and community-owned. The work is slower and messier than scraping the web. It is also more accountable.
What’s emerging is not just a technical correction, but a shift in power.
As AI moves into healthcare, education, and public services, language stops being a cosmetic feature. It becomes the interface through which people are recognised or ignored. When systems understand only formal, standardised speech, they privilege certain users over others.
When machines begin to understand how people actually speak, they don’t just talk differently. They also listen differently.
