March 11, 2024

Understanding Learning from Human Preferences

Abstract

The prevalent deployment for learning from human preferences through reinforcement learning (RLHF) relies on two important approximations: the first assumes that pairwise preferences can be substituted with pointwise rewards. The second assumes that a reward model trained on these pointwise rewards can generalize from collected data to out-of-distribution data sampled by the policy. Recently, Direct Preference Optimization (DPO) is proposed as an approach that bypass the second approximation and learn directly a policy from collected data without the reward modelling stage. However, DPO still heavily relies on the first approximation.

In this paper we try to gain a deeper theoretical understanding of these practical algorithms. In particular we derive a new generalized objective called $\Psi$PO for learning from human preferences that is expressed in terms of pairwise preferences and therefore bypasses both approximations. This new general objective allows us to perform an in-depth analysis of the behavior of RLHF and DPO (as special cases of $\Psi$PO) and to identify their potential pitfalls. We then consider another special case for $\Psi$PO called Identity Preference Optimization (IPO) for which we can derive an efficient optimization procedure, prove performance guarantees and demonstrate the empirical performance on a synthetic example. Our in-depth analysis shows how DPO and RLHF can be prone to over-training despite despite being regularized while IPO remains robust to over-training.

Authors

Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Zhaohan Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos

Venue

(deprecated) AISTATS2024

Explore our next generation AI systems

Our latest AI breakthroughs and updates from the lab

Unlocking a new era of discovery with AI

Our mission is to build AI responsibly to benefit humanity

Understanding Learning from Human Preferences

Abstract

Authors

Venue