Abstract
In Reinforcement learning (RL) an agent acts so as to maximize its return under uncertainty. It is natural to apply Bayesian probabilistic inference to the uncertain parameters and, since the goal of the agent is to find the optimal policy, a relevant object of study is the posterior probability of optimality for each state-action. Previous work on `RL as inference' has equipped the agent with a surrogate potential in order to estimate this quantity, however the approximation can be arbitrarily poor, leading to algorithms that do not perform well in practice. In this work, we rigorously analyze how the posterior probability of optimality flows through the Markov decision process (MDP) and show that sampling according to this probability yields a guaranteed Bayesian regret bound. In practice computing this probability is intractable, so we derive a variational Bayesian approximation yielding a tractable convex optimization problem and show that the resulting policy also satisfies a Bayesian regret bound. We call our approach VAPOR and show that it has deep connections to Thompson sampling, K-learning, information theory, and maximum entropy exploration.
Authors
Jean Tarbouriech, Tor Lattimore, Brendan O'Donoghue
Venue
NeurIPS 2023