PaliGemma 2

A family of lightweight, open, vision-language models that can interpret text and image inputs.

Explore PaliGemma 2

Combining the SigLIP-So400m vision encoder with Gemma 2, PaliGemma 2 is available in 3B, 10B, and 28B parameter sizes and trained at multiple resolutions to provide broad knowledge for transfer via fine-tuning.

Capabilities
Models
Download

Capabilities

Multimodal input

Capable of answering questions about images or short videos with details and context.

Versatile base models

Supports fine-tuning across various sizes and resolutions for tailored vision-language capabilities.

Off-the-shelf exploration

Comes with a checkpoint fine-tuned on a mixture of specialized tasks for immediate use.

Model variants

PaliGemma PT

General purpose pre-trained models that can be fine-tuned on a variety of tasks.

PaliGemma FT

Research-oriented models that are fine-tuned on specific research datasets.

MIX

PaliGemma 2 mix

Models tuned to a mixture of tasks that can be used out-of-the-box for common use cases.

Download PaliGemma 2

Hugging Face

Download

Kaggle