With the help of Google Cloud’s Vertex Training Clusters (VTC) infrastructure and model training science support, AI Singapore developed a state-of-the-art SEA language model based on Gemma 3 27B.
In late 2025, AI Singapore (AISG) released SEA-LION v4, the next chapter for their flagship series of language models built for the languages and cultures of Southeast Asia (SEA).
For this new version, the team chose Gemma 3 27B as its foundation. This model architecture provided a powerful starting point with multimodal processing, a long context window, broad multilingual coverage, and native function calling.
Adapting a 27-billion-parameter model to deeply understand cultural context while retaining these advanced capabilities—and doing so on a tight schedule—required a specialized training environment. The AISG team turned to Google Cloud’s Vertex Training Clusters (VTC), to simplify and accelerate large-scale model development and customization. By integrating training optimizations into the setup, the team boosted throughput by up to 23% and training step speed by almost 2x.
SEA-LION models, including the previous version based on Gemma 2, are designed to enable real world use. The latest release, based on Gemma 3, is the #1 model for Tamil and Filipino and ranks #4 out of 55 models on the SEA-HELM leaderboard, outperforming much larger systems while running on a laptop with 32GB RAM.
In order to ensure timely and cost-effective development, the team needed to optimize the entire training pipeline for efficient resource utilization. This article covers the entire process:
VTC provided the AI Singapore team with the necessary data science tooling, optimized training recipes, and a resilient Slurm-based infrastructure for distributed training.
Creating a specialized language model begins with two foundational choices: the base model and its tokenizer. These decisions have a cascading effect on the entire development lifecycle, from data preparation to training efficiency and final performance.
AISG needed a model with a strong, pre-existing foundation in both multilingual understanding and multimodality, leading them to choose the Gemma 3 27B IT model. The Gemma 3 family stood out in many benchmarks and offered native text-and-image processing out of the box. The 27-billion-parameter size offered the best of both worlds: It was large enough to capture complex patterns in data but remained within a manageable scale for efficient training and deployment.
The team found the Gemma 3 tokenizer required no modifications, providing robust performance on SEA language datasets in assessments.
The team’s approach to data selection was split between curating a broad knowledge base for continued pre-training and crafting precise, skill-oriented data for post-training.
Enriching the model’s core knowledge, particularly in lower-resource languages, required a vast, clean dataset.
Experiments confirmed that Gemma 3 had likely already seen high-quality public data for high-resource languages. Consequently, we found that prioritizing novel, “unseen” data for languages like Indonesian yielded better results than retraining on “seen” data.
Post-training is where the model learns to be helpful, follow instructions, and respond safely. High-quality instruction-following data in many SEA languages is scarce, making synthetic data generation a cornerstone of model alignment.
GPU compute is a scarce and expensive resource and with strict timelines, the quality of training comes down to how effectively one can navigate the complex experimental landscape. The team focused on maximizing insights from every run and prioritizing rapid, small-scale experiments with smaller models like Gemma 3 1B and 4B before scaling successful approaches to the larger, more resource-intensive 27B model.
Vertex AI provided a customized and optimized version of Nvidia’s NeMo Framework, which delivered a significant 13-23% throughput improvement over Composer. Working with Vertex AI engineers, the team further optimized the NeMo container for Gemma 3 and improved training step latency by 1.96x.
The primary challenge during continued pre-training was to teach the model new linguistic and cultural nuances without losing pre-existing capabilities. This required a meticulous data mixing strategy, carefully balancing new data with the model’s original training corpus. Methods like model merging also offered a powerful way to combine a specialized model’s knowledge with the base model’s general capabilities.
Once the model has sufficient knowledge, post-training aligns it to be a more helpful assistant that provides responses that are more grounded in local contexts. The post-training phase began with supervised fine-tuning on instruction-response pairs, then progressed to alignment algorithms like Direct Preference Optimization (DPO) and Reinforcement Learning (RL) to align the model outputs closer to human preferences.
Executing these steps effectively is where purpose-built frameworks like NVIDIA’s NeMo-RL are essential, providing optimized and scalable implementations for the state-of-the-art alignment techniques.
A cohesive evaluation strategy is paramount for guiding model development and must be adapted to the specific nuances of the target use case. SEA-LION v4 was evaluated on four pillars.
During training, the team developed an understanding of where the training was heading by looking at the loss curve, identifying inflection points, and checking if metrics aligned with expectation.
Additional metrics like perplexity-based and log probability-based evaluations provided deeper insights into the model’s fundamental language and sequence understanding, enabling informed decisions earlier in the training process.
For English, the team evaluated the model on BBH, GPQA, IFEval, Math-Hard, MMLU-Pro, and MUSR tasks to ensure the model’s core reasoning, understanding, and instruction-following capabilities remained world-class.
For SEA languages, they used the SEA-HELM benchmark suite which tests performance for Burmese, Filipino, Indonesian, Malay, Tamil, Thai and Vietnamese across a diverse range of tasks, such as natural language processing, chat, and instruction-following.
To address the signal gap in pre-training data for low-resource languages, novel evaluation sets for Burmese and Khmer were generated using Vertex AI’s synthetic data capabilities. This approach ensured these languages had reliable metrics for measuring progress.
The team worked with native speakers to understand the model’s performance with the depth and detail that only a native speaker could provide. Standardizing these metrics allowed them to evaluate the model with fine-grained understanding in low-resource language settings.
A state-of-the-art model is not a static artifact but a living project. Our development philosophy treats every training run as a formal experiment, meticulously tracking metrics to compare different training methods and validate hypotheses. Each successful experiment that yields a superior model culminates in a new, clearly documented version.
This rigorous model versioning is crucial not only for reproducibility and traceability but also for managing a continuous cycle of retraining that ensures the model’s capabilities consistently evolve and improve over time.
SEA-LION v4 represents the next step toward democratizing state-of-the-art AI for Southeast Asia. By training Gemma 3 on Vertex AI, AI Singapore has released a model that is technically performant, culturally knowledgeable, and useful for all.
Thank you to the following for their contributions to this article and the model's development: Thomas Le Moullec, Irina Sigler, Mayank Sharan, Mohammadreza Mohseni, Chris Elliott, Chloe Huang, Bingyuan Liu, Tian Shi, Fanny Wei, Jiuqiang Tang, Xiang Xu, Minwoo Park, Ting Yu, Michelle Loh, Saurabh Mangal, Pratyusha Mukherjee, Robert Van Dusen, Stephanie Sim.