D4RT: Unified, Fast 4D Scene Reconstruction & Tracking

Introducing D4RT, a unified AI model for 4D scene reconstruction and tracking across space and time.

Anytime we look at the world, we perform an extraordinary feat of memory and prediction. We see and understand things as they are at a given moment in time, as they were a moment ago, and how they are going to be in the moment to follow. Our mental model of the world maintains a persistent representation of reality and we use that model to draw intuitive conclusions about the causal relationship between the past, present and future.

To help machines see the world more like we do, we can equip them with cameras, but that only solves the problem of input. To make sense of this input, computers must solve a complex, inverse problem: taking a video — which is a sequence of flat 2D projections — and recovering or understanding the rich, volumetric 3D world, in motion.

Today, we are introducing D4RT (Dynamic 4D Reconstruction and Tracking), a new AI model that unifies dynamic scene reconstruction into a single, efficient framework, bringing us closer to the next frontier of artificial intelligence: total perception of our dynamic reality.

The Challenge of the Fourth Dimension

In order for it to understand a dynamic scene captured on a 2D video, an AI model must track every pixel of every object as it moves through the three dimensions of space and the fourth dimension of time. In addition, it must disentangle this motion from the motion of the camera, maintaining a coherent representation even when objects move behind one another or leave the frame entirely. Traditionally, capturing this level of geometry and motion from 2D videos requires computationally intensive processes or a patchwork of specialized AI models — some for depth, others for movement or camera angles — resulting in AI reconstructions that are slow and fragmented.

D4RT’s simplified architecture and novel query mechanism place it at the forefront of 4D reconstruction while being up to 300x more efficient than previous methods — fast enough for real-time applications in robotics, augmented reality, and more.

How D4RT Works: A Query-Based Approach

D4RT operates as a unified encoder-decoder Transformer architecture. The encoder first processes the input video into a compressed representation of the scene’s geometry and motion. Unlike older systems that employed separate modules for different tasks, D4RT calculates only what it needs using a flexible querying mechanism centered around a single, fundamental question:

"Where is a given pixel from the video located in 3D space at an arbitrary time, as viewed from a chosen camera?"

Building on our prior work, a lightweight decoder then queries this representation to answer specific instances of the posed question. Because queries are independent, they can be processed in parallel on modern AI hardware. This makes D4RT extremely fast and scalable, whether it’s tracking just a few points or reconstructing an entire scene.

D4RT combines a powerful encoder that builds a rich, global understanding of the video, and a lightweight decoder that answers thousands of queries in parallel. By asking specific questions — identifying where a source pixel is located at a target time and camera view — the model efficiently solves diverse tasks like tracking, depth estimation, and pose estimation through a single, flexible interface.

Capabilities: Fast, Accurate 4D Understanding

With this flexible formulation, a wide variety of 4D tasks can now be solved by the model, including:

Point Tracking: By querying a pixel's location across different time steps, D4RT can predict its 3D trajectory. Importantly, an object need not be visible on other frames of the video for the model to make a prediction.
Point Cloud Reconstruction: By freezing time and the camera viewpoint, D4RT can directly generate the complete 3D structure of a scene, eliminating extra steps such as separate camera estimation or per-video iterative optimization.
Camera Pose Estimation: By generating and aligning 3D snapshots of a single moment from different viewpoints, D4RT can easily recover the camera's trajectory.

As detailed in the underlying technical report, D4RT outperforms previous methods across a wide spectrum of 4D reconstruction tasks. Qualitative comparisons show that while other methods struggle with dynamic objects — often duplicating them or failing to reconstruct them entirely — D4RT maintains a solid, continuous understanding of the moving world.

Crucially, D4RT’s precision does not come at the expense of efficiency. In testing, it performed 18x to 300x faster than the previous state of the art. For example, D4RT processed a one-minute video in roughly five seconds on a single TPU chip. Previous state-of-the-art methods could take up to ten minutes for the same task — an improvement of 120x.

Downstream Applications

D4RT demonstrates that we don't need to choose between accuracy and efficiency in 4D reconstruction. Its flexible, query-based system can capture our dynamic world in real-time, paving the way for the next generation of spatial computing. This includes:

Robotics: Robots need to navigate dynamic environments populated by moving people and objects. D4RT can provide the spatial awareness required for safe navigation and dextrous manipulation.
Augmented Reality (AR): For AR glasses to overlay digital objects onto the real world, they need an instant, low-latency understanding of a scene’s geometry. D4RT’s efficiency contributes to making on-device deployment a tangible reality.
World Models: By effectively disentangling camera motion, object motion, and static geometry, D4RT brings us a step closer to AI that possesses a true “world model” of physical reality — a necessary step on the path to AGI.

We're continuing to explore the model’s capabilities and potential for applications across robotics, augmented reality, and beyond.

Read our technical report

Visit our project website