Abstract
Causal attribution of behaviour is a foundational problem in interpretability. Activation Patching is a method of directly computing these causal attributions, and is ubiquitous in mechanistic interpretability analyses. However, scaling it to many attributions requires a sweep with cost scales linear in the number of model components, which can be prohibitively expensive, involving millions to billions of forward passes in SoTA models. We propose to use Attribution Patching (AtP), a gradient-based approximation that runs in $O(1)$ passes, as a pre-filtering step to Activation Patching.
We investigate the performance of AtP, finding two classes of failure modes which produce false negatives. We propose a variant of AtP called AtP, with two changes to address these failure modes while retaining scalability. We present the first systematic study of AtP and alternative methods for faster activation patching and show that AtP significantly outperforms all other investigated methods, with AtP providing further significant improvement. Finally, we provide a method to estimate the residual error of AtP* and bound the probability of remaining false negatives.
Authors
János Kramár, Tom Lieberum, Neel Nanda, Rohin Shah
Venue
arXiv