Jump to Content

AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents

Published
View publication Download

Abstract

Large foundation models that incorporate language, vision, and more recently actions have revolutionized the ability to harness internet scale data to learn knowledge and reasoning for everyday tasks. In this paper we show how existing foundation models can be used to scale up the deployment of operational robots incompletely unseen scenarios with minimal human supervision. We refer to this system as AutoRT, which runs on over 50 robots across multiple buildings, collecting 77k real robot episodes via both teleoperation and autonomously moving robots. Such “in-the-wild” data is significantly more diverse than previous robotic datasets collected in robot lab settings, and improves robotics policies when used in co-fine-tuning. When combined with vision, large language models can propose novel instructions based on their environment and reason about autonomy tradeoffs and safety without finetuning, allowing orchestration of large scale robot fleets. We further show how introducing LLMs and VLMs into data collection creates new forms of interaction for steering robot agents to collect more diverse data or data for specific settings.

Authors

Alex Irpan, Keerthana Gopalakrishnan, Sergey Levine, Ted Xiao, Peng Xu, Ryan Julian, Sean Kirmani, Debidatta Dwibedi, Karol Hausman, Dorsa Sadigh, ichter , yaolug , Stefan Welker, Pannag Sanketi, Kanishka Rao, Edward Lee, Fei Xia, Isabel Leal, Pierre Sermanet, Nikhil Joshi, Zhuo Xu, Quan Vuong, Michael Ahn, chelseaf , Montse Gonzalez Arenas, Steve Xu, Sharath Maddineni

Venue

arXiv