Abstract
Visual understanding of our world goes beyond the semantics and flat structure of individual images. In this paper, we work towards capturing both the 3D structure as well as the dynamics of real-world scenes from monocular real-world videos. Our model, the Dynamic Scene Transformer (DyST), builds upon recent work in neural scene representation and learns a latent decomposition into scene content as well as per-view scene dynamics and camera pose. This separation is achieved through a special co-training scheme on monocular videos and our new synthetic dataset DySO. DyST learns tangible latent representations for dynamic scenes that enable view generation with separate control over the camera and the content of the scene.
Authors
Max Seitzer*, Sjoerd van Steenkiste, Thomas Kipf, Klaus Greff, Mehdi S. M. Sajjadi
Venue
ICLR 2024