Abstract
The prevalent deployment for learning from human preferences through reinforcement learning (RLHF) relies on two important approximations: the first assumes that pairwise preferences can be substituted with pointwise rewards. The second assumes that a reward model trained on these pointwise rewards can generalize from collected data to out-of-distribution data sampled by the policy. Recently, Direct Preference Optimization (DPO) is proposed as an approach that bypass the second approximation and learn directly a policy from collected data without the reward modelling stage. However, DPO still heavily relies on the first approximation.
In this paper we try to gain a deeper theoretical understanding of these practical algorithms. In particular we derive a new generalized objective called $\Psi$PO for learning from human preferences that is expressed in terms of pairwise preferences and therefore bypasses both approximations. This new general objective allows us to perform an in-depth analysis of the behavior of RLHF and DPO (as special cases of $\Psi$PO) and to identify their potential pitfalls. We then consider another special case for $\Psi$PO called Identity Preference Optimization (IPO) for which we can derive an efficient optimization procedure, prove performance guarantees and demonstrate the empirical performance on a synthetic example. Our in-depth analysis shows how DPO and RLHF can be prone to over-training despite despite being regularized while IPO remains robust to over-training.
Authors
Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Zhaohan Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos
Venue
(deprecated) AISTATS2024