March 19, 2024

DiPaCo: Distributed Path Composition

Abstract

Progress in machine learning (ML) has been fueled by scaling neural network models. This scaling has been enabled by ever more heroic feats of engineering, necessary for accommodating ML approaches that require high bandwidth communication between devices working in parallel. In this work, we propose a co-designed modular architecture and training approach for ML models, dubbed DIstributed PAth COmposition (DiPaCo). During training, DiPaCo distributes computation by paths through a set of shared modules. Together with a Local-SGD inspired optimization (DiLoCo) that keeps modules in sync with drastically reduced communication, our approach enables training across poorly connected and potentially heterogeneous workers. At test time, only a single path needs to be executed for each input, without the need for any model compression. We consider this approach as a first prototype towards a new paradigm of large-scale learning, one that is less synchronous and more modular. Our experiments on the widely used C4 benchmark show that for the same amount of training steps but less wall-clock time, DiPaCo exceeds the performance of a 1B dense transformer language model using 256 paths of size 150M.

Authors

Arthur Douillard, Qixuan Feng, Andrei A. Rusu, Adhiguna Kuncoro , Yani Donchev , Rachita Chhaparia , Ionel Gog , Marc’Aurelio Ranzato, Jiajun Shen, Arthur Szlam

Venue

arXiv

Explore our next generation AI systems

Our latest AI breakthroughs and updates from the lab

Unlocking a new era of discovery with AI

Our mission is to build AI responsibly to benefit humanity

DiPaCo: Distributed Path Composition

Abstract

Authors

Venue