Abstract
In this paper, we present Randomized Q-learning (RandQL), a new algorithm that doesn't rely on a model and uses randomness to minimize regret in episodic Markov Decision Processes (MDPs). As far as we know, RandQL is the first feasible algorithm based on posterior sampling without needing a model. We evaluate RandQL's performance in both tabular and non-tabular settings. In tabular MDPs, RandQL achieves a regret bound of approximately O(sqrt(H^5 * S * A * T)), where H is the planning horizon, S is the number of states, A is the number of actions, and T is the number of episodes. For a metric state-action space, RandQL's regret bound is approximately O(H^(5/2) * T * (dz+1)/(dz+2)), where dz represents the zooming dimension. Importantly, RandQL achieves optimistic exploration without using bonuses; instead, it relies on a novel concept of randomizing the learning rate. Our experimental results demonstrate that RandQL surpasses existing methods in standard exploration environments.
Authors
Daniil Tiapkin*, Denis Belomestny*, Daniele Calandriello, Éric Moulines*, Rémi Munos, Alexey Naumov*, Pierre Perrault*, Michal Valko, Pierre Ménard*
Venue
NeurIPS 2023