Jump to Content

Evaluating Frontier Models for Dangerous Capabilities

Published
View publication Download

Abstract

To understand the risks of a new AI system, we must understand what it can and cannot do. To this end, we introduce a programme of new "dangerous capability" evaluations and pilot them on Gemini models. These evaluations cover five topics: (1) persuasion & deception; (2) cyber-security; (3) self-proliferation; (4) self-reasoning & self-modification; and (5) biological and nuclear risk. We do not find evidence of strong dangerous capabilities in the models we evaluated, but we flag early warning signs. Our goal is to help lay the groundwork for a rigorous science of dangerous capability evaluation, in preparation for future, more capable models.

Authors

Mary Phuong, Matthew Aitchison, Elliot Catt, Scogan ‎(Sarah Cogan), Heidi Howard, Alex Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Maria Abi Raad, Tom Lieberum, Sebastian Farquhar, gdelt , Anian Ruoss, Marcus Hutter, Rohin Shah, Allan Dafoe, Toby Shevlane, Sarah Hodkinson, Sharon Lin, Albert Webson, Lewis Ho, Seliem El-Sayed, Sasha Brown

Venue

arXiv