Active offline policy selection

Reinforcement learning (RL) has made tremendous progress in recent years towards addressing real-life problems – and offline RL made it even more practical. Instead of direct interactions with the environment, we can now train many algorithms from a single pre-recorded dataset. However, we lose the practical advantages in data-efficiency of offline RL when we evaluate the policies at hand.

For example, when training robotic manipulators the robot resources are usually limited, and training many policies by offline RL on a single dataset gives us a large data-efficiency advantage compared to online RL. Evaluating each policy is an expensive process, which requires interacting with the robot thousands of times. When we choose the best algorithm, hyperparameters, and a number of training steps, the problem quickly becomes intractable.

To make RL more applicable to real-world applications like robotics, we propose using an intelligent evaluation procedure to select the policy for deployment, called active offline policy selection (A-OPS). In A-OPS, we make use of the prerecorded dataset and allow limited interactions with the real environment to boost the selection quality.

Active offline policy selection (A-OPS) selects the best policy out of a set of policies given a pre-recorded dataset and limited interaction with the environment.

To minimise interactions with the real environment, we implement three key features:

Off-policy policy evaluation, such as fitted Q-evaluation (FQE), allows us to make an initial guess about the performance of each policy based on an offline dataset. It correlates well with the ground truth performance in many environments, including real-world robotics where it is applied for the first time.

FQE scores are well aligned with the ground truth performance of policies trained in both sim2real and offline RL setups.

The returns of the policies are modelled jointly using a Gaussian process, where observations include FQE scores and a small number of newly collected episodic returns from the robot. After evaluating one policy, we gain knowledge about all policies because their distributions are correlated through the kernel between pairs of policies. The kernel assumes that if policies take similar actions – such as moving the robotic gripper in a similar direction – they tend to have similar returns.

We useOPE scores and episodic returns to model latent policy performance as a Gaussian process.

Similarity between the policies is modelled through the distance between the actions these policies produce.

To be more data-efficient, we apply Bayesian optimisation and prioritise more promising policies to be evaluated next, namely those that have high predicted performance and large variance.

We demonstrated this procedure in a number of environments in several domains: dm-control, Atari, simulated, and real robotics. Using A-OPS reduces the regret rapidly, and with a moderate number of policy evaluations, we identify the best policy.

In a real-world robotic experiment, A-OPS helps identify a very good policy faster than other baselines. To find a policy with close to zero regret out of 20 policies takes the same amount of time as it takes to evaluate two policies with current procedures.

Our results suggest that it’s possible to make an effective offline policy selection with only a small number of environment interactions by utilising the offline data, special kernel, and Bayesian optimisation. The code for A-OPS is open-sourced and available on GitHub with an example dataset to try.

Explore our next generation AI systems

Our latest AI breakthroughs and updates from the lab

Unlocking a new era of discovery with AI

Our mission is to build AI responsibly to benefit humanity

Active offline policy selection