Gemini Robotics 1.5
Our most capable vision-language-action (VLA) model, which turns visual information and instructions into motor commands to perform a task
Our agentic Gemini-based multimodal model allows robots to take action in the physical world.
Capabilities
Gemini models are capable of responding to text, images, audio, and video. Gemini Robotics adds the ability to reason about physical spaces – allowing robots to take action in the real world.
-
Generality
Understands the physical world, and adapts and generalizes its behaviour to fit new situations. Breaks down goals into manageable steps to make longer-term plans and overcome unexpected problems.
-
Interactivity
Understands and responds to everyday commands. Can explain its approach while taking action. Users can redirect it at any point, without using technical language. It also adjusts to any changes in its environment.
-
Dexterity
Enables robots to tackle complex tasks requiring fine motor skills and precise manipulation – like folding origami, packing a lunch box, or preparing a salad.
-
Thinking
Enables robots to think before acting, improving the quality of their actions, and making their decisions more transparent in natural language.
-
Multiple embodiments
Adapts to a diverse array of robot forms, from bi-arm static robotic platforms like ALOHA and Bi-arm Franka, to humanoid robots like Apptronik’s Apollo. A single model can be used across all these robots, in turn accelerating its learning across multiple embodiments.
Benchmarks
Gemini Robotics 1.5 consistently outperforms our previous models across all four categories of generalization.
Model deployment status | Private preview |
Supported data types for input | Image, Text |
Supported data types for output | Text, Action |
Supported # tokens for input | 32k |
Knowledge cutoff | October 2024 |
Availability | Partners |