Scalable AI Safety via Doubly-Efficient Debate

Published: 23 November 2023

Abstract

The emergence of powerful pre-trained AI systems with super-human capabilities across a diverse and ever-increasing set of complex domains has raised a critical challenge for AI safety as tasks can become too complicated for humans to judge directly. AI safety via debate [Irving et al. 2018] is a notable proposal in this direction with the goal of pitting the power of such AI models against each other until the (mis)-alignment identification problem is broken down into a manageable sub-task. While the promise of this approach is clear, the original framework was based on the assumption that the honest strategy is able to simulate deterministic AI systems for an exponential number of steps, limiting its applicability. In this paper, we show how to address these challenges by designing a new set of debate protocols where the honest strategy can always succeed using a simulation of a polynomial number of steps, whilst being able to verify the alignment of stochastic AI systems, even when the dishonest strategy is allowed to use exponentially many simulation steps.

Authors

Jonah Brown-Cohen, Geoffrey Irving, Georgios Piliouras

Venue

arXiv

Gemini

Gemma

Generative models

Experiments

Projects

Publications

News

AI for biology

AI for climate and sustainability

AI for mathematics and computer science

AI for physics and chemistry

AI transparency

News

Careers

Milestones

Education

Responsibility

The Podcast

Scalable AI Safety via Doubly-Efficient Debate

Abstract

Authors

Venue

Scalable AI Safety via Doubly-Efficient Debate

Share

Abstract

Authors

Venue