Jump to Content

Mirasol3B: A Multimodal Autoregressive Model for Time-Aligned and Contextual Modalities

Published
View publication

Abstract

One of the main challenges of multimodal models is that they need to combine heterogeneous modalities (e.g. video, audio, text), which have different characteristics. For example, video and audio are obtained at much higher rates than text and are roughly aligned in time. They are not necessarily synchronized with text which is often present as a global context, e.g. a title or description. Furthermore, video and audio inputs are of much larger volumes, and can grow with the increase of the video lengths, which naturally requires more compute dedicated to these modalities and makes modeling of long-range dependencies harder.

We here decouple the multimodal modeling, dividing it into separate, focused autoregressive models, processing the inputs according to the characteristics of the modalities and their sampling rates. More specifically, we propose a multimodal model, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for modalities which are not necessarily aligned in time but are still sequential. To address the long-sequences of the video-audio inputs, we propose to further partition the video and audio sequences and autoregressively process their compact learned representations. To that end, we propose a Combiner mechanism, which models the audio-video information within a video snippet. The Combiner first learns to extract audio and visual features from raw spatio-temporal signals, and then learns to fuse these features producing compact but expressive representations per snippet, which are added to the subsequent autoregressive modeling in time.

Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models. It effectively addresses the high computational demand of media inputs by both learning compact representations, controlling the sequence length of the audio-video feature representations, and producing powerful visual and audio representations which allow for long-term dependencies and modeling of long-form video or video/audio inputs in time.

Authors

AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael S. Ryoo, Victor Gomes, Anelia Angelova

Venue

CVPR 2024