The Institute of Science Tokyo creates powerful Japanese-focused LLM with Gemma 2
The Institute of Science Tokyo together with the Advanced Industrial Science and Technology (AIST) is working to create LLMs that excel in Japanese. Their research consists of finding and deconstructing popular LLMs that demonstrate excellent capabilities in language understanding, generation, and dialogue, and pre-training them with large bodies of Japanese text to create their Swallow line of LLMs.
Science Tokyo and AIST latest research efforts resulted in the creation of Gemma-2-Llama Swallow, a new LLM that delivers unparalleled Japanese language knowledge and performance, thanks also to Gemma’s high base-level proficiency in the language.
The challenge
The institute recognized that many of the world’s most-popular LLMs focus on western languages like English, and lack reliable utility in European languages, southeastern languages, and in this case, Japanese. And the cost of the models that did have Japanese functionality outweighed the performance they offered.
The Swallow developer team began creating Japanese-focused iterations of popular models from Llama, Mistral, and Mixtral to varying degrees of success. Eventually, the team chose Gemma 2 for its stronger base-level Japanese capabilities. “Gemma 2 already exhibited strong instruction-following and dialogue capabilities in Japanese,” said Naoaki Okazaki, professor at the Institute of Science Tokyo. But the team knew they could make the model even better.
Chart representing Gemma-2-Llama Swallow 27B IT v0.1 superior performance.
The solution
To improve Gemma 2’s Japanese proficiency, the team had to pre-train the model with a massive amount of Japanese training data specifically developed by the Swallow team. Gemma 2’s base-level proficiency in Japanese helped simplify the tokenizing process, allowing the team to skip modifying the tokenizer or token embeddings.
Compared to other overseas LLMs, Gemma 2's tokenizer vocabulary includes a larger number of Japanese characters and words. This eliminated the need to modify the tokenizer or token embeddings before continual pre-training. Furthermore, its less restrictive licensing allowed us to leverage it for tasks such as filtering Japanese pre-training data and synthesizing instruction-tuning data.
Professor Naoaki Okazaki, Institute of Science Tokyo
This team’s efforts resulted in the creation of Gemma-2-Llama Swallow in 2B, 9B, and 27B parameter versions. “Since Gemma 2 already exhibited strong instruction-following and dialogue capabilities in Japanese, we were able to employ imitation learning for the instruction tuning of our model from Gemma 2 27B,” said Professor Naoaki Okazaki. The size and performance of Gemma 2 27B helped the team save valuable resources as well.
The impact
To put Gemma-2-Llama Swallow to the test, the team used 10 Japanese understanding and generation tasks, 10 English understanding and generation tasks, and the Japanese MT-bench to evaluate the performance of the models.
Because Gemma 2 27B demonstrates performance comparable to other open LLMs in the 70B class, we were able to construct synthetic data using fewer computational resources.
Professor Naoaki Okazaki, Institute of Science Tokyo
The team found that - at the time of release - Gemma-2-Llama Swallow demonstrated the highest performance among LLMs of comparable size in Japanese language understanding and generation tasks, while Gemma-2-Llama Swallow’s 2B and 9B variants stood out the most by exhibiting performance on par with LLMs one size class larger for less resources.
What’s next
The Institute of Science Tokyo will continue to refine Gemma-2-Llama Swallow following its initial launch in May 2025. The team expects the LLM will encourage more research and adoption of Japanese proficient models. “Despite being relatively smaller LLMs, they could be applied to a variety of applications,” said Okazaki, highlighting the nimbleness of the models while simultaneously matching the performance of 70B-class LLMs. The team is also working on improving the Japanese capability of Gemma 3 to create Swallow models that are even faster, more powerful, and more cost-efficient.
These models represent another step towards human-like intelligence in computers for the researchers at the Institute of Science Tokyo. “Realizing artificial intelligence has been a dream since the dawn of computing,” said Okazaki. “Large language models are bringing us closer to that reality. We are entering an exciting era where, as AI developers, we can witness computers becoming increasingly intelligent.”