Multimodal input
Capable of answering questions about images or short videos with details and context.
Explore PaliGemma 2
Combining the SigLIP-So400m vision encoder with Gemma 2, PaliGemma 2 is available in 3B, 10B, and 28B parameter sizes and trained at multiple resolutions to provide broad knowledge for transfer via fine-tuning.
Capable of answering questions about images or short videos with details and context.
Supports fine-tuning across various sizes and resolutions for tailored vision-language capabilities.
Comes with a checkpoint fine-tuned on a mixture of specialized tasks for immediate use.
General purpose pre-trained models that can be fine-tuned on a variety of tasks.
Research-oriented models that are fine-tuned on specific research datasets.
Models tuned to a mixture of tasks that can be used out-of-the-box for common use cases.