Abstract
Recent work transfers large-scale image-to-text models to the video domain via shallow late temporal fusion while keeping the image encoder frozen. In contrast, we train video-first encoders, plug them into a frozen LM, and demonstrate that utilizing joint space-time attention yields improvements on benchmarks with strong temporal dependencies (e.g., YouCook2, VATEX). However, we expose two limitations to the approach: (1) decreased spatial capabilities, likely due to noisy video-language alignments and (2) higher memory consumption, bottlenecking the number of frames that can be processed. To mitigate the memory bottleneck, we systematically analyze ways to improve memory efficiency for training video-first encoders including input sampling, parameter-efficient image-to-video adaptation, and factorized attention. Surprisingly, just masking large portions of the video (up to 75%) proves to be the most robust way to scale encoders to videos up to 4.3 minutes at 1 fps rate. Our simple approach for training long video-to-text models, which account for less than 1B parameters, is able to outperform the popular paradigm of using a strong LLM as information aggregator over segment-based information on benchmarks with long range temporal dependencies (YouCook2, EgoSchema).
Authors
Nelly Papalampidi, Skanda Koppula, Shreya Pathak, Justin Chiu, Viorica Patraucean, Joe Heyward, Jiajun Shen, Antoine Miech, Andrew Zisserman, Aida Nematzadeh
Venue
CVPR 2024