PaliGemma 2

A family of lightweight, open, vision-language models that can interpret text and image inputs.

Combining the SigLIP-So400m vision encoder with Gemma 2, PaliGemma 2 is available in 3B, 10B, and 28B parameter sizes and trained at multiple resolutions to provide broad knowledge for transfer via fine-tuning.


Capabilities

image

Multimodal input

Capable of answering questions about images or short videos with details and context.

sync_alt

Versatile base models

Supports fine-tuning across various sizes and resolutions for tailored vision-language capabilities.

explore

Off-the-shelf exploration

Comes with a checkpoint fine-tuned on a mixture of specialized tasks for immediate use.


Model variants

PT

PaliGemma PT

General purpose pre-trained models that can be fine-tuned on a variety of tasks.

FT

PaliGemma FT

Research-oriented models that are fine-tuned on specific research datasets.

MIX

PaliGemma 2 mix

Models tuned to a mixture of tasks that can be used out-of-the-box for common use cases.