Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA

Published: 26 April 2025

Abstract

Large language models (LLMs) are computationally expensive to deploy. Parameter sharing offers a promising path towards reducing their size and cost, but its effectiveness in modern LLMs remains limited. In this work, we revisit parameter sharing for LLMs, and introduce a novel method for converting existing LLMs into smaller, ``Recursive Transformers'' that share parameters across layers, with minimal loss of performance. The Recursive Transformer is efficiently initialized from a standard, pre-trained Transformer, but only uses a single block of unique layers that are applied multiple times in a loop. We further show that the performance of the Recursive Transformer can be substantially improved with little additional overhead by incorporating loop-specific, low-rank adapters that facilitate layer specialization, while maintaining a compact representation. Through careful initialization and minimal additional uptraining, we observe that recursive models, converted from models twice their size, can achieve better performance than full-size pretrained models of equivalent size. Notably, when combined with knowledge distillation, our relaxed models achieve performance comparable to the original unshared model pretrained on a significantly larger corpus. Finally, we introduce Continuous Depth-wise Batching, a new inference paradigm that the Recursive Transformer allows for, when paired with early-exiting-based dynamic computation schemes. Theoretically, we show that this can enable dynamic sample scheduling at various depths for significant throughput gains (for up to 2.76$\times$ speedup).

Authors

Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, Tal Schuster

Venue

ICLR 2025

Gemini

Gemma

Generative models

Experiments

Projects

Publications

News

AI for biology

AI for climate and sustainability

AI for mathematics and computer science

AI for physics and chemistry

AI transparency

News

Careers

Milestones

Education

Responsibility

The Podcast

Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA

Abstract

Authors

Venue

Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA

Share

Abstract

Authors

Venue