Abstract
Large language models (LLMs) are computationally expensive to deploy. Parameter sharing offers a promising path towards reducing their size and cost, but its effectiveness in modern LLMs remains limited. In this work, we revisit parameter sharing for LLMs, and introduce a novel method for converting existing LLMs into smaller, ``Recursive Transformers'' that share parameters across layers, with minimal loss of performance. The Recursive Transformer is efficiently initialized from a standard, pre-trained Transformer, but only uses a single block of unique layers that are applied multiple times in a loop. We further show that the performance of the Recursive Transformer can be substantially improved with little additional overhead by incorporating loop-specific, low-rank adapters that facilitate layer specialization, while maintaining a compact representation. Through careful initialization and minimal additional uptraining, we observe that recursive models, converted from models twice their size, can achieve better performance than full-size pretrained models of equivalent size. Notably, when combined with knowledge distillation, our relaxed models achieve performance comparable to the original unshared model pretrained on a significantly larger corpus. Finally, we introduce Continuous Depth-wise Batching, a new inference paradigm that the Recursive Transformer allows for, when paired with early-exiting-based dynamic computation schemes. Theoretically, we show that this can enable dynamic sample scheduling at various depths for significant throughput gains (for up to 2.76$\times$ speedup).
Authors
Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, Tal Schuster
Venue
ICLR 2025