Sarvam AI built a translation model with Gemma 3 to translate all 22 officially recognized Indian languages
Sarvam AI is working to help build a sovereign AI ecosystem for India that caters to the country’s wide-ranging culture and diversity. Selected for the INDIAai Mission, Sarvam is developing a platform that will empower governments, enterprises, and nonprofits to use gen AI to develop agents and applications uniquely tailored to India’s scale and reality.
The organization’s recently-released translation model, Sarvam-Translate, was developed with Gemma 3 to accurately translate across all 22 officially recognized Indian languages. With the help of Gemma, this new model is able to translate long-form and structured documents while preserving context, coherence, and cultural nuance.
The challenge
With a population of 1.4 billion, India boasts an incredible level of scale, but also rich cultural diversity. And yet many of the world’s most-utilized LLMs have traditionally been trained on the languages and cultures of the western world, which means they lack a deeper understanding of India’s unique nuances. But even as multilingual LLMs improve, Sarvam found that there are still discernable quality issues when translating many Indian languages, especially with mixed-format inputs that contain Markdown, HTML, scientific notation, and code.
To address the root level issue of translations, Sarvam AI built Sarvam-Translate, an open-weights model that translates text across 22 Indian languages, with ability to handle diverse formats, contexts, and styles.
Chart demonstrating Gemma 3’s lower fertility for Indian languages.
The solution
Sarvam chose Gemma 3 for Sarvam-Translate because it offered the most efficient tokenization for Indian languages among open models. This efficiency helped lower the overall training and inferencing costs. Sarvam settled on the 4B variant of Gemma 3 for its balance between performance and cost-efficiency, allowing the team to scale with lower infrastructure costs and latency.
[Gemma] breaks down Indian language text into fewer tokens on average, which directly improves the model’s ability to represent and learn from these languages efficiently.
Pratyush Kumar, Sarvam founder
Another key benefit of Gemma is its native level of multilingual understanding across several Indian languages, enabling the team to reach convergence faster and shortening the fine-tuning process. The LLM was trained on domain-specific data focused on long-form translation across Indian languages up to an 8K context. The dataset consisted of both mined and manually validated data that was carefully cleaned. The content within the dataset covered scientific and historical content, conversational and modern text, and structurally complex formats such as code, LaTeX, HTML, and chemistry equations.
Example of Sarvam-Translate maintaining syntax and formatting across languages.
After training the model on Gemma3-4B-IT, the team went to the two-part fine-tuning stage. First, the developers fine-tuned the model on a larger but more diverse dataset to improve the range of its translation capabilities, including for languages it is not already fluent in. Second, the team fine-tuned the model with LoRA with a smaller, highly curated, format-diverse dataset with a focus on preservation and style consistency. Sarvam also applied post-training quantization and deployed with NVIDIA’s TensorRT engine, resulting in strong performance and lower compute and energy usage.
Example of Sarvam-Translate translating code and preserving its functionality.
The impact
As a result of that training, Sarvam-Translate is now an open-source model that is successfully being deployed for real-world applications, and is available now on Hugging Face. The team was able to accomplish its goal of making the model highly proficient in translating long-form and structured content across 15 of the 22 official Indian languages while also achieving paragraph-level translation for all 22 languages.
As a hosted translation API, the model has served over 100,000 translation requests in the last week alone, covering everything from internal tools to consumer-facing interfaces. The translation API is also a key component of Samvaad, the organization’s conversational AI agent platform, which has handled over 10 million conversation turns across Indian languages.
Since open-sourcing the model, we’ve received a lot of positive feedback from the developer community, especially around language quality and usability. Adoption has been steadily growing, and we’ve seen organic engagement from users who are integrating it into their own workflows.
Pratyush Kumar, Sarvam founder
What’s next
Going forward, Sarvam is looking to improve tokenizer support for very low-resource Indian languages like Manipuri, Santali, and Kashmiri for greater inclusivity. While the team succeeded at making Sarvam-Translate highly efficient at translating long-form and structured content, its developers are looking to round out the model’s capabilities by improving support for colloquial and informal language.
The team is also working to deliver scalable on-premise deployments for enterprises so they can integrate the model into their workflows to keep data private and maintain compliance. “AI is becoming core infrastructure, and we believe India needs to build that infrastructure for itself. What excites us is the chance to shape technology that actually reflects how India thinks, speaks, and solves problems,” concluded Kumar.