Introducing D4RT, a unified AI model for 4D scene reconstruction and tracking across space and time.
Anytime we look at the world, we perform an extraordinary feat of memory and prediction. We see and understand things as they are at a given moment in time, as they were a moment ago, and how they are going to be in the moment to follow. Our mental model of the world maintains a persistent representation of reality and we use that model to draw intuitive conclusions about the causal relationship between the past, present and future.
To help machines see the world more like we do, we can equip them with cameras, but that only solves the problem of input. To make sense of this input, computers must solve a complex, inverse problem: taking a video — which is a sequence of flat 2D projections — and recovering or understanding the rich, volumetric 3D world, in motion.
Today, we are introducing D4RT (Dynamic 4D Reconstruction and Tracking), a new AI model that unifies dynamic scene reconstruction into a single, efficient framework, bringing us closer to the next frontier of artificial intelligence: total perception of our dynamic reality.
In order for it to understand a dynamic scene captured on a 2D video, an AI model must track every pixel of every object as it moves through the three dimensions of space and the fourth dimension of time. In addition, it must disentangle this motion from the motion of the camera, maintaining a coherent representation even when objects move behind one another or leave the frame entirely. Traditionally, capturing this level of geometry and motion from 2D videos requires computationally intensive processes or a patchwork of specialized AI models — some for depth, others for movement or camera angles — resulting in AI reconstructions that are slow and fragmented.
D4RT’s simplified architecture and novel query mechanism place it at the forefront of 4D reconstruction while being up to 300x more efficient than previous methods — fast enough for real-time applications in robotics, augmented reality, and more.
D4RT operates as a unified encoder-decoder Transformer architecture. The encoder first processes the input video into a compressed representation of the scene’s geometry and motion. Unlike older systems that employed separate modules for different tasks, D4RT calculates only what it needs using a flexible querying mechanism centered around a single, fundamental question:
"Where is a given pixel from the video located in 3D space at an arbitrary time, as viewed from a chosen camera?"
Building on our prior work, a lightweight decoder then queries this representation to answer specific instances of the posed question. Because queries are independent, they can be processed in parallel on modern AI hardware. This makes D4RT extremely fast and scalable, whether it’s tracking just a few points or reconstructing an entire scene.
D4RT combines a powerful encoder that builds a rich, global understanding of the video, and a lightweight decoder that answers thousands of queries in parallel. By asking specific questions — identifying where a source pixel is located at a target time and camera view — the model efficiently solves diverse tasks like tracking, depth estimation, and pose estimation through a single, flexible interface.
With this flexible formulation, a wide variety of 4D tasks can now be solved by the model, including:
As detailed in the underlying technical report, D4RT outperforms previous methods across a wide spectrum of 4D reconstruction tasks. Qualitative comparisons show that while other methods struggle with dynamic objects — often duplicating them or failing to reconstruct them entirely — D4RT maintains a solid, continuous understanding of the moving world.
Crucially, D4RT’s precision does not come at the expense of efficiency. In testing, it performed 18x to 300x faster than the previous state of the art. For example, D4RT processed a one-minute video in roughly five seconds on a single TPU chip. Previous state-of-the-art methods could take up to ten minutes for the same task — an improvement of 120x.
D4RT demonstrates that we don't need to choose between accuracy and efficiency in 4D reconstruction. Its flexible, query-based system can capture our dynamic world in real-time, paving the way for the next generation of spatial computing. This includes:
We're continuing to explore the model’s capabilities and potential for applications across robotics, augmented reality, and beyond.