Abstract
Despite recent advances in multilingual capabilities of Large Language Models (LLMs), LLMs remain limited to textual scripts, hampering the potential of knowledge transfer across languages from different writing systems and introducing potential biases from pre-training on specific writing scripts. As languages are naturally perceived in both writing and speech, phonemic transcriptions could provide essential signals for enhancing multilingual learning in text-based LLMs. In this work, we first conduct a pilot study on the performance discrepancy between languages from different writing scripts across state-of-the-art LLM families, demonstrating the benefits of integrating phonemic signals to enhance overall language representations and facilitate multilingual knowledge transfer. We then explore integrating phonemic signals with existing LLMs via enhanced In-context-learning (ICL) Retrieval to improve performance across various downstream NLP tasks at inference time.
Authors
Hoang H. Nguyen, Khyati Mahajan, Vikas Yadav, Julian Salazar, Philip S. Yu, Masoud Hashemi, Rishabh Maheshwary
Venue
NAACL 2025