Jump to Content

SODA: Bottleneck Diffusion Models for Representation Learning

Published
View publication Download

Abstract

We introduce SODA, a self-supervised diffusion model, explored for the purpose of representation learning. The model incorporates a conditional visual encoder, which distills an input image into a compact representation, that, in turn, guides the generation of novel views of the input's content. We show that imposing a tight bottleneck between the visual encoder and the denoising decoder, in the form of a sole embedding through which they can communicate, turns diffusion models into strong and efficient representation learners, capable of capturing and predicting images' semantic properties in an unsupervised manner, as we demonstrate by attaining high performance on varied classification, reconstruction and synthesis tasks over a wide array of datasets ranging from CelebA to ShapeNet to ImageNet. Further investigation of the model's generative qualities reveals the disentangled nature of its emerging latent space, which serves as an effective interface to control and manipulate the produced outputs, so to create diverse views and variations. All in all, we aim to shed light on the exciting and promising potential of diffusion models, not only for image generation, but also for learning rich and robust representations.

Authors

Drew A. Hudson, Daniel Zoran, Mateusz Malinowski, Andrew Lampinen, Drew Jaegle, Jay Mcclelland, Loic Matthey, Felix Hill, Alexander Lerchner

Venue

CVPR 2024