Abstract
We present an innovative approach to infer semantic concepts from input images using a pre-trained image diffusion model. Traditionally employed for generating images, we repurpose this model to learn semantically consistent key points across diverse image datasets. Our method involves adding noise to the input image, simulating the denoising process, and subsequently extracting the attention maps. From these attention maps, we select the k sharpest ones, subjecting them to further sharpening to enhance precision. Additionally, we introduce an equivariance constraint ensuring that the learned key points remain consistent even under various image transformations. Notably, our approach demonstrates superior performance over competing methods, particularly excelling in non-aligned data, showcasing its potential in accurately identifying and retaining semantically meaningful key points across images.
Authors
Eric Hedlin, Gopal Sharma, Shweta Mahajan, Hossam Isack, Abhishek Kar, Helge Rhodin, Andrea Tagliasacchi, Kwang Moo Yi
Venue
CVPR 2024