Jump to Content

DyST: Towards Dynamic Neural Scene Representations on Real-World Videos

Published
View publication

Abstract

Visual understanding of our world goes beyond the semantics and flat structure of individual images. In this paper, we work towards capturing both the 3D structure as well as the dynamics of real-world scenes from monocular real-world videos. Our model, the Dynamic Scene Transformer (DyST), builds upon recent work in neural scene representation and learns a latent decomposition into scene content as well as per-view scene dynamics and camera pose. This separation is achieved through a special co-training scheme on monocular videos and our new synthetic dataset DySO. DyST learns tangible latent representations for dynamic scenes that enable view generation with separate control over the camera and the content of the scene.

Authors

Max Seitzer*, Sjoerd van Steenkiste, Thomas Kipf, Klaus Greff, Mehdi S. M. Sajjadi

Venue

ICLR 2024