AI Singapore

SEA-LION v4 is an open multimodal model trained for Southeast Asian languages and cultural contexts

With the help of Google Cloud’s Vertex Training Clusters (VTC) infrastructure and model training science support, AI Singapore developed a state-of-the-art SEA language model based on Gemma 3 27B.

In late 2025, AI Singapore (AISG) released SEA-LION v4, the next chapter for their flagship series of language models built for the languages and cultures of Southeast Asia (SEA).

For this new version, the team chose Gemma 3 27B as its foundation. This model architecture provided a powerful starting point with multimodal processing, a long context window, broad multilingual coverage, and native function calling.

Adapting a 27-billion-parameter model to deeply understand cultural context while retaining these advanced capabilities—and doing so on a tight schedule—required a specialized training environment. The AISG team turned to Google Cloud’s Vertex Training Clusters (VTC), to simplify and accelerate large-scale model development and customization. By integrating training optimizations into the setup, the team boosted throughput by up to 23% and training step speed by almost 2x.

Building SEA-LION v4


SEA-LION models, including the previous version based on Gemma 2, are designed to enable real world use. The latest release, based on Gemma 3, is the #1 model for Tamil and Filipino and ranks #4 out of 55 models on the SEA-HELM leaderboard, outperforming much larger systems while running on a laptop with 32GB RAM.

In order to ensure timely and cost-effective development, the team needed to optimize the entire training pipeline for efficient resource utilization. This article covers the entire process:

  1. Base model and tokenizer selection: Assessing Gemma 3 capabilities
  2. Data preparation and curation: Curating data for pre-training and post-training
  3. Experimentation: Maximizing training efficiency for experimentation
  4. Training: Strategy and techniques used in model training
  5. Evaluation: Building a comprehensive evaluation suite
  6. Lifecycle: Managing model operations


VTC provided the AI Singapore team with the necessary data science tooling, optimized training recipes, and a resilient Slurm-based infrastructure for distributed training.

Base model and tokenizer selection

Creating a specialized language model begins with two foundational choices: the base model and its tokenizer. These decisions have a cascading effect on the entire development lifecycle, from data preparation to training efficiency and final performance.

AISG needed a model with a strong, pre-existing foundation in both multilingual understanding and multimodality, leading them to choose the Gemma 3 27B IT model. The Gemma 3 family stood out in many benchmarks and offered native text-and-image processing out of the box. The 27-billion-parameter size offered the best of both worlds: It was large enough to capture complex patterns in data but remained within a manageable scale for efficient training and deployment.

For languages use cases, a critical early decision is whether the base model’s tokenizer is sufficient or if it needs to be modified. Two main paths can be explored to improve it: First Extending the Tokenizer, which involves adding new tokens to the existing vocabulary. This is a good option when the base tokenizer is mostly effective but lacks specific characters, or common words in the target languages. Second, Training a new Tokenizer, which is a more complex and resource consuming approach. It requires a complete pre-training run but can lead to long-term benefits in performance and efficiency.

The team found the Gemma 3 tokenizer required no modifications, providing robust performance on SEA language datasets in assessments.

Data preparation and curation

The team’s approach to data selection was split between curating a broad knowledge base for continued pre-training and crafting precise, skill-oriented data for post-training.


Selecting data for continued pre-training (CPT)

Enriching the model’s core knowledge, particularly in lower-resource languages, required a vast, clean dataset.

  • Data cleaning: Deduplication and sophisticated filtering using a combination of heuristics and an “LLM-as-a-judge” approach to remove low-quality content and toxic language, validated by initial small-batch reviews with native speakers
  • Data tagging: Quality tagging system to classify data tiers (e.g., ‘high’, ‘normal’, ‘low’) and enable quality-aware data mixing, where higher-quality data (like research papers) is strategically weighted over lower-quality sources (like casual forum posts). This boosted performance in Indonesian and Tagalog without causing regressions elsewhere

Experiments confirmed that Gemma 3 had likely already seen high-quality public data for high-resource languages. Consequently, we found that prioritizing novel, “unseen” data for languages like Indonesian yielded better results than retraining on “seen” data.


Crafting data for post-training

Post-training is where the model learns to be helpful, follow instructions, and respond safely. High-quality instruction-following data in many SEA languages is scarce, making synthetic data generation a cornerstone of model alignment.

  • From CPT data: A multi-step AI agent pipeline was built to transform CPT documents into high-quality multi-turn conversational data. The agent checks if a document could lead to a conversation, creates conversation outlines, drafts the dialogue, and reviews quality.
  • Preference data from supervised fine-tuned (SFT) conversations: SFT conversations were used to create “better/worse” preference pairs for preference tuning, identifying areas where the assistant’s response was strong and then rewriting it to be intentionally worse

Maximizing efficiency for experimentation

GPU compute is a scarce and expensive resource and with strict timelines, the quality of training comes down to how effectively one can navigate the complex experimental landscape. The team focused on maximizing insights from every run and prioritizing rapid, small-scale experiments with smaller models like Gemma 3 1B and 4B before scaling successful approaches to the larger, more resource-intensive 27B model.

Vertex AI provided a customized and optimized version of Nvidia’s NeMo Framework, which delivered a significant 13-23% throughput improvement over Composer. Working with Vertex AI engineers, the team further optimized the NeMo container for Gemma 3 and improved training step latency by 1.96x.

Training

The primary challenge during continued pre-training was to teach the model new linguistic and cultural nuances without losing pre-existing capabilities. This required a meticulous data mixing strategy, carefully balancing new data with the model’s original training corpus. Methods like model merging also offered a powerful way to combine a specialized model’s knowledge with the base model’s general capabilities.

Once the model has sufficient knowledge, post-training aligns it to be a more helpful assistant that provides responses that are more grounded in local contexts. The post-training phase began with supervised fine-tuning on instruction-response pairs, then progressed to alignment algorithms like Direct Preference Optimization (DPO) and Reinforcement Learning (RL) to align the model outputs closer to human preferences.

Executing these steps effectively is where purpose-built frameworks like NVIDIA’s NeMo-RL are essential, providing optimized and scalable implementations for the state-of-the-art alignment techniques.

Building a comprehensive evaluation suite

A cohesive evaluation strategy is paramount for guiding model development and must be adapted to the specific nuances of the target use case. SEA-LION v4 was evaluated on four pillars.


1. Training metrics

During training, the team developed an understanding of where the training was heading by looking at the loss curve, identifying inflection points, and checking if metrics aligned with expectation.

Additional metrics like perplexity-based and log probability-based evaluations provided deeper insights into the model’s fundamental language and sequence understanding, enabling informed decisions earlier in the training process.


2. Benchmarks

For English, the team evaluated the model on BBH, GPQA, IFEval, Math-Hard, MMLU-Pro, and MUSR tasks to ensure the model’s core reasoning, understanding, and instruction-following capabilities remained world-class.

For SEA languages, they used the SEA-HELM benchmark suite which tests performance for Burmese, Filipino, Indonesian, Malay, Tamil, Thai and Vietnamese across a diverse range of tasks, such as natural language processing, chat, and instruction-following.


3. Synthetic evaluation data

To address the signal gap in pre-training data for low-resource languages, novel evaluation sets for Burmese and Khmer were generated using Vertex AI’s synthetic data capabilities. This approach ensured these languages had reliable metrics for measuring progress.


4. Human-in-the-Loop validation

The team worked with native speakers to understand the model’s performance with the depth and detail that only a native speaker could provide. Standardizing these metrics allowed them to evaluate the model with fine-grained understanding in low-resource language settings.

Model lifecycle & operations

A state-of-the-art model is not a static artifact but a living project. Our development philosophy treats every training run as a formal experiment, meticulously tracking metrics to compare different training methods and validate hypotheses. Each successful experiment that yields a superior model culminates in a new, clearly documented version.

This rigorous model versioning is crucial not only for reproducibility and traceability but also for managing a continuous cycle of retraining that ensures the model’s capabilities consistently evolve and improve over time.

SEA-LION v4 represents the next step toward democratizing state-of-the-art AI for Southeast Asia. By training Gemma 3 on Vertex AI, AI Singapore has released a model that is technically performant, culturally knowledgeable, and useful for all.


Thank you to the following for their contributions to this article and the model's development: Thomas Le Moullec, Irina Sigler, Mayank Sharan, Mohammadreza Mohseni, Chris Elliott, Chloe Huang, Bingyuan Liu, Tian Shi, Fanny Wei, Jiuqiang Tang, Xiang Xu, Minwoo Park, Ting Yu, Michelle Loh, Saurabh Mangal, Pratyusha Mukherjee, Robert Van Dusen, Stephanie Sim.


More from the Gemmaverse

The Ministry of Economy, Ecology and Agriculture of Ukraine digitizes licensing process with Gemma

Adaptive ML trains Gemma 3 for exceptional multilingual results

Quarks improves user experiences with Gemma 2 and Gemma 3

Sarvam AI built a translation model with Gemma 3 to translate all 22 officially recognized Indian languages

Institute of Science Tokyo creates powerful Japanese-focused LLM with Gemma 2

Introducing GAIA, a Brazilian Portuguese Gemma 3 model developed with ABRIA, CEIA, Nama, and Amadeus AI