Generality
Understands the physical world, and adapts and generalizes its behaviour to fit new situations. Breaks down goals into manageable steps to make longer-term plans and overcome unexpected problems.
Powering an era of physical agents to transform how robots actively understand their environments
Our most capable vision-language-action (VLA) model, which turns visual information and instructions into motor commands to perform a task.
Our state-of-the-art embodied reasoning model – it specializes in understanding physical spaces, planning, and making logical decisions within its surroundings.
To ensure Gemini Robotics benefits humanity, we’ve taken a comprehensive approach to safety, from practical safeguards to collaborations with experts, policymakers, and our Responsibility and Safety Council.
Gemini Robotics models allow robots of any shape and size to perceive, reason, use tools and interact with humans. They can solve a wide range of complex real-world tasks – even those they haven’t been trained to complete.
Gemini Robotics 1.5 is designed to reason through multi-step complex tasks, and to make decisions to form a plan of action. It will then work to carry out each step autonomously.
Understands the physical world, and adapts and generalizes its behaviour to fit new situations. Breaks down goals into manageable steps to make longer-term plans and overcome unexpected problems.
Assess complex challenges, natively call tools – like Google Search – to look up information, and create detailed step-by-step plans to overcome them.
Enables robots to think before acting, improving the quality of their actions, and making their decisions more transparent in natural language.
Understands and responds to everyday commands. Can explain its approach while taking action. Users can redirect it at any point, without using technical language. It also adjusts to any changes in its environment.
Enables robots to tackle complex tasks requiring fine motor skills and precise manipulation – like folding origami, packing a lunch box, or preparing a salad.
Adapts to a diverse array of robot forms, from bi-arm static robotic platforms like ALOHA and Bi-arm Franka, to humanoid robots like Apptronik’s Apollo. A single model can be used across all these robots, in turn accelerating its learning across multiple embodiments.
Uses digital tools autonomously to solve complex tasks.
Solves longer, multi-step tasks – without needing new instructions after each step.
Transfers learned motions across robots of different sizes and shapes, helping robots to become more useful.
Understand its environment and how to complete a task.
Generalizes across novel situations and solves a vast range of tasks.
Responds to natural conversation and adapts rapidly to changing environments.
Helping to build the next generation of humanoid robots.
Perform tasks that require fine motor skills and coordination.
Our most capable vision-language-action (VLA) model. It can ‘see’ (vision), ‘understand’ (language) and ‘act’ (action) within the physical world. It processes visual inputs and user prompts, learning within different embodiments and increasing its ability to generalize problem-solving.
Our state-of-the-art embodied reasoning model. It specializes in understanding physical spaces, planning, and making logical decisions relating to its surroundings. It doesn’t directly control robotic limbs – but provides high-level insights to help the VLA model decide what to do next.
This iteration of our VLA model is incredibly versatile, and optimized to run locally on robotic devices. This will allow robotics developers to adapt the model to improve performance on their own applications.