Jump to Content


Fast reinforcement learning through the composition of behaviours


André Barreto, Shaobo Hou, Diana Borsa, David Silver, Doina Precup

A cropped close-up of a pile of coffee beans.

The compositional nature of intelligence

Imagine if you had to learn how to chop, peel and stir all over again every time you wanted to learn a new recipe. In many machine learning systems, agents often have to learn entirely from scratch when faced with new challenges. It’s clear, however, that people learn more efficiently than this: they can combine abilities previously learned. In the same way that a finite dictionary of words can be reassembled into sentences of near infinite meanings, people repurpose and re-combine skills they already possess in order to tackle novel challenges.

In nature, learning arises as an animal explores and interacts with its environment in order to gather food and other rewards. This is the paradigm captured by reinforcement learning (RL): interactions with the environment reinforce or inhibit particular patterns of behavior depending on the resulting reward (or penalty). Recently, the combination of RL with deep learning has led to impressive results, such as agents that can learn how to play boardgames like Go and chess, the full spectrum of Atari games, as well as more modern, difficult video games like Dota and StarCraft II.

A major limitation in RL is that current methods require vast amounts of training experience. For example, in order to learn how to play a single Atari game, an RL agent typically consumes an amount of data corresponding to several weeks of uninterrupted playing. A study led by researchers at MIT and Harvard indicated that in some cases, humans are able to reach the same performance level in just fifteen minutes of play.

One possible reason for this discrepancy is that, unlike humans, RL agents usually learn a new task from scratch. We would like our agents to leverage knowledge acquired in previous tasks to learn a new task more quickly, in the same way that a cook will have an easier time learning a new recipe than someone who has never prepared a dish before. In an article recently published in the Proceedings of the National Academy of Sciences (PNAS), we describe a framework aimed at endowing our RL agents with this ability.

Two ways of representing the world

To illustrate our approach, we will explore an example of an activity that is (or at least used to be) an everyday routine: the commute to work. Imagine the following scenario: an agent must commute every day from its home to its office, and it always gets a coffee on the way. There are two cafes between the agent's house and the office: one has great coffee but is on a longer path, and the other one has decent coffee but a shorter commute (Figure 1). Depending on how much the agent values the quality of the coffee versus how much of a rush it is in on a given day, it may choose one of two routes (the yellow and blue paths on the map shown in Figure 1).

A simple illustrated map shows three routes from home to the office, complete with buildings, trees, and a skyline. The blue route goes via cafe A, which has three stars for coffee, and five stars for food. The yellow route is slightly longer and goes via cafe B, with five stars for coffee and three stars for food. The there's a longer red route which goes through a park and past both cafes.

Figure 1: A map of an illustrative work commute.

Traditionally, RL algorithms fall into two broad categories: model-based and model-free agents (Figures 2 & 3). A model-based agent (Figure 2) builds a representation of many aspects of the environment. An agent of this type might know how the different locations are connected, the quality of the coffee in each cafe, and anything else that is considered relevant. A model-free agent (Figure 3) has a much more compact representation of its environment. For instance, a value-based model-free agent would have a single number associated with each possible route leaving its home; this is the expected "value" of each route, reflecting a specific weighing of coffee quality vs. commute length. Take the blue path shown in Figure 1 as an example. Say this path has length 4, and the coffee the agent gets by following it is rated 3 stars. If the agent cares about the commute distance 50% more than it cares about the quality of the coffee, the value of this path will be (-1.5 x 4) + (1 x 3) = -3 (we use a negative weight associated with the distance to indicate that longer commutes are undesirable).

A simplified version of the commute map, with only the key data: the routes and their distance, the stopping points, and their ratings.

Figure 2: How a model-based agent represents the world. Only details relevant to the agent are captured in the representation (compare with Figure 1). Still, the representation is considerably more complex than the one used by a model-free agent (compare with Figure 3).

Figure 3: How a value-based model-free agent represents the world. For each location the agent has one number associated with each possible course of action; this number is the “value” of each alternative available to the agent. When in a given location, the agent checks the values available and makes a decision based on this information only (as illustrated in the right figure for the location “home”). In contrast with the model-based representation, the information is stored in a non-spatial way, that is, there are no connections between locations (compare with Figure 2).

We can interpret the relative weighting of the coffee quality versus the commute distance as the agent’s preferences. For any fixed set of preferences, a model-free and a model-based agent would choose the same route. Why then have a more complicated representation of the world, like the one used by a model-based agent, if the end result is the same? Why learn so much about the environment if the agent ends up sipping the same coffee?

Preferences can change day to day: an agent might take into account how hungry it is, or whether it’s running late to a meeting, in planning its route to the office. One way for a model-free agent to handle this is to learn the best route associated with every possible set of preferences. This is not ideal because learning every possible combination of preferences will take a long time. It is also impossible to learn a route associated with every possible set of preferences if there are infinitely many of them.

In contrast, a model-based agent can adapt to any set of preferences, without any learning, by "imagining" all possible routes and asking how well they would fulfill its current mindset. However, this approach also has drawbacks. Firstly, ”mentally” generating and evaluating all possible trajectories can be computationally demanding. Secondly, building a model of the entire world can be very difficult in complex environments.

Model-free agents learn faster but are brittle to change. Model-based agents are flexible but can be slow to learn. Is there an intermediate solution?

Successor features: a middle ground

A recent study in behavioural science and neuroscience suggests that in certain situations, humans and animals make decisions based on an algorithmic model that is a compromise between the model-free and model-based approaches (here and here). The hypothesis is that, like model-free agents, humans also compute the value of alternative strategies in the form of a number. But, instead of summarising a single quantity, humans summarise many different quantities describing the world around them, reminiscent of model-based agents.

It’s possible to endow an RL agent with the same ability. In our example, such an agent would have, for each route, a number representing the expected quality of coffee and a number representing the distance to the office. It could also have numbers associated with things the agent is not deliberately trying to optimise but are nevertheless available to it for future reference (for example, the quality of the food in each cafe). The aspects of the world the agent cares about and keeps track of are sometimes referred to as “features”. Because of that, this representation of the world is called successor features (previously termed the “successor representation” in its original incarnation).

Successor features can be thought of as a middle ground between the model-free and model-based representations. Like the latter, successor features summarise many different quantities, capturing the world beyond a single value. However, like in the model-free representation, the quantities the agent keeps track of are simple statistics summarising the features it cares about. In this way, successor features are like an “unpacked” version of the model-free agent. Figure 4 illustrates how an agent using successor features would see our example environment.

Figure 4: Representing the world using successor features. This is similar to how a model-free agent represents the world, but, instead of one number associated with each path, the agent has several numbers (in this case, coffee, food and distance). That is, at the location “home”, the agent would have nine, rather than three, numbers to weigh according to its preferences at the moment (compare with Figure 3).

Using successor features: composing novel plans from a dictionary of policies

Successor features are a useful representation because they allow for a route to be evaluated under different sets of preferences. Let’s use the blue route in Figure 1 as an example again. Using successor features, the agent would have three numbers associated with this path: its length (4), the quality of the coffee (3) and the quality of the food (5). If the agent already ate breakfast it will probably not care much about the food; also, if it is late, it might care about the commute distance more than the quality of the coffee --say, 50% more, as before. In this scenario the value of the blue path would be (-1.5 x 4) + (1 x 3) + (0 x 5) = -3, as in the example given above. But now, on a day when the agent is hungry, and thus cares about the food as much as it cares about the coffee, it can immediately update the value of this route to (-1.5 x 4) + (1 x 3) + (1 x 5) = 2. Using the same strategy, the agent can evaluate any route according to any set of preferences.

In our example, the agent is choosing between routes. More generally, the agent will be searching for a policy: a prescription of what to do in every possible situation. Policies and routes are closely related: in our example, a policy that chooses to take the road to cafe A from home and then chooses the road to the office from cafe A would traverse the blue path. So, in this case, we can talk about policies and routes interchangeably (this would not be true if there were some randomness in the environment, but we will leave this detail aside). We discussed how successor features allow a route (or policy) to be evaluated under different sets of preferences. We call this process generalised policy evaluation, or GPE.

Why is GPE useful? Suppose the agent has a dictionary of policies (for example, known routes to the office). Given a set of preferences, the agent can use GPE to immediately evaluate how well each policy in the dictionary would perform under those preferences. Now the really interesting part: based on this quick evaluation of known policies, the agent can create entirely new policies on the fly. The way it does it is simple: every time the agent has to make a decision, it asks the following question: “if I were to make this decision and then follow the policy with the maximum value thereafter, which decision would lead to the maximum overall value?” Surprisingly, if the agent picks the decision leading to the maximum overall value in each situation, it ends up with a policy that is often better than the individual policies used to create it.

This process of “stitching together” a set of policies to create a better policy is called generalised policy improvement, or GPI. Figure 5 illustrates how GPI works using our running example.

Three diagrams show the agent's preferred route from home to the office. Like in the previous examples, it has three options of route passing different combinations of park, cafe A and cafe B.

Figure 5: How GPI works. In this example the agent cares about the commute distance 50% more than it cares about coffee and food quality. The best thing to do in this case is to visit cafe A, then visit cafe B, and finally go to the office. The agent knows three policies associated with the blue, yellow, and orange paths (see Figure 1). Each policy traverses a different path, but none of them coincides with the desired route. Using GPE, the agent evaluates the three policies according to its current set of preferences (that is, weights -1.5, 1, and 1 associated with distance, coffee and food, respectively). Based on this evaluation, the agent asks the following question at home: “if I were to follow one of the three policies all the way to the office, which one would be best?” Since the answer to this question is the blue policy, the agent follows it. However, instead of commiting to the blue policy all the way, when the agent arrives at cafe A it asks the same question again. Now, instead of the blue path, the agent follows the orange one. By repeating this process the agent ends up following the best path to the office to satisfy its preferences, even though none of its known policies would do so on their own.

The performance of a policy created through GPI will depend on how many policies the agent knows. For instance, in our running example, as long as the agent knows the blue and yellow paths, it will find the best route for any preferences over coffee quality and commute length. But the GPI policy will not always find the best route. In Figure 1, the agent would never visit cafe A and then cafe B if it did not already know a policy that connected them in this way (like the orange route in the figure).

A simple example to show GPE and GPI in action

To illustrate the benefits of GPE and GPI, we now give a glimpse of one of the experiments from our recent publication (see paper for full details). The experiment uses a simple environment that represents in an abstract way the type of problem in which our approach can be useful. As shown in Figure 6, the environment is a 10 x 10 grid with 10 objects spread across it. The agent only gets a non-zero reward if it picks up an object, in which case another object pops up in a random location. The reward associated with an object depends on its type. Object types are meant to represent concrete or abstract concepts; to connect with our running example, we will consider that each object is either “coffee” or “food” (these are the features the agent keeps track of).

A ten by ten square grid shows the three different routes in red, blue and grey. Each route has icons in the squares representing coffee or food.

Figure 6: Simple environment to illustrate the usefulness of GPE and GPI. The agent moves around using the four directional actions (“up”, “down”, “left” and “right”) and receives a non-zero reward when it picks up an object. The reward associated with an object is defined by its type (“coffee” or “food”).

Clearly, the best strategy for the agent depends on its current preferences over coffee or food. For example, in Figure 6, an agent that only cares about coffee may follow the path in red, while an agent focused exclusively on food would follow the blue path. We can also imagine intermediate situations in which the agent wants coffee and food with different weights, including the case in which the agent wants to avoid one of them. For example, if the agent wants coffee but really does not want food, the gray path in Figure 6 may be a better alternative to the red one.

The challenge in this problem is to quickly adapt to a new set of preferences (or a “task”). In our experiments we showed how one can do so using GPE and GPI. Our agent learned two policies: one that seeks coffee and one that seeks food. We then tested how well the policy computed by GPE and GPI performed on tasks associated with different preferences. In figure 7 we compare our method with a model-free agent on the task whose goal is to seek coffee while avoiding food. Observe how the agent using GPE and GPI instantaneously synthesises a reasonable policy, even though it never learned how to deliberately avoid objects. Of course, the policy computed by GPE and GPI can be used as an initial solution to be later refined through learning, which means that it would match the final performance of a model-free agent but would probably get there faster.

A line graph with the y axis labelled "average sum of rewards". Towards the top, a horizontal blue line represents the GPE-GPI agent. A red line representing model-free agent (Q-learning) starts at the bottom of the graph, but then curves upwards to sit above the blue line.

Figure 7: A GPE-GPI agent learns to perform well given much less training data than a model-free method (Q-learning). Here the task is to seek coffee while avoiding food. The GPE-GPI agent learned two policies, one that seeks coffee and one that seeks food. It manages to avoid food even though it has never been trained to avoid an object. Shadowed regions are one standard deviation over 100 runs.

Figure 7 shows the performance of GPE and GPI on one specific task. We have also tested the same agent across many other tasks. Figure 8 shows what happens with the performance of the model-free and GPE-GPI agents when we change the relative importance of coffee and food. Note that, while the model-free agent has to learn each task separately, from scratch, the GPE-GPI agent only learns two policies and then quickly adapts to all of the tasks.

A bar graph showing average sum of rewards for the performance of the GPE-GPI agent over different tasks. A dotted line across the top of the chart represents the model-free agent (Q learning) solving each task individually. The bars in the middle third are highest, almost in line with the Q learning. The bars towards the edges drop off to almost zero.

Figure 8: Performance of the GPE-GPI agent over different tasks. Each bar corresponds to a task induced by a set of preferences over coffee and food. The colour gradients under the graph represent the sets of preferences: blue indicates positive weight, white indicates zero weight, and red indicates negative weight. So, for example, at the extremes of the graph we have tasks in which the goal is essentially to avoid one type of object while ignoring the other, while at the centre the task is to seek both types of objects with equal impetus. Error bars are one standard deviation over 10 runs.

The experiments above used a simple environment designed to exhibit the properties needed by GPE and GPI without unnecessary confounding factors. But GPE and GPI have also been applied at scale. For example, in previous papers (here and here) we showed how the same strategy also works when we replace a grid world with a three dimensional environment in which the agent receives observations from a first-person perspective (see illustrative videos here and here). We have also used GPE and GPI to allow a four-legged simulated robot to navigate along any direction after having learned how to do so along three directions only (see paper here and video here).

GPE and GPI in context

The work on GPE and GPI is at the intersection of two separate branches of research related to these operations individually. The first, related to GPE, is the work on the successor representation, initiated with Dayan’s seminal paper from 1993. Dayan’s paper inaugurated a line of work in neuroscience that is very active to this day (see further reading: "The successor representation in neuroscience"). Recently, the successor representation reemerged in the context of RL (links here and here), where it is also referred to as “successor features”, and became an active line of research there as well (see further reading: "GPE, successor features, and related approaches"). Successor features are also closely related to general value functions, a concept based on Sutton et al.’s hypothesis that relevant knowledge can be expressed in the form of many predictions about the world (also discussed here). The definition of successor features has independently emerged in other contexts within RL, and is also related to more recent approaches normally associated with deep RL.

The second branch of research at the origins of GPE and GPI, related to the latter, is concerned with composing behaviours to create new behaviours. The idea of a decentralised controller that executes sub-controllers has come up multiple times over the years (e.g., Brooks, 1986), and its implementation using value functions can be traced back to at least as far as 1997, with Humphrys’ and Karlsson’s PhD theses. GPI is also closely related to hierarchical RL, whose foundations were laid down in the 1990's and early 2000’s in the works by Dayan and Hinton, Parr and Russell, Sutton, Precup and Singh, and Dietterich. Both the composition of behaviours and hierarchical RL are today dynamic areas of research (see further reading: "GPI, hierarchical RL, and related approaches").

Mehta et al. were probably the first ones to jointly use GPE and GPI, although in the scenario they considered GPI reduces to a single choice at the outset (that is, there is no “stitching” of policies). The version of GPE and GPI discussed in this blog post was first proposed in 2016 as a mechanism to promote transfer learning. Transfer in RL dates back to Singh’s work in 1992 and has recently experienced a resurgence in the context of deep RL, where it continues to be an active area of research (see further reading: "GPE + GPI, transfer learning, and related approaches").

See more information about these works below, where we also provide a list of suggestions for further readings.

A compositional approach to reinforcement learning

In summary, a model-free agent cannot easily adapt to new situations, for example to accommodate sets of preferences it has not experienced before. A model-based agent can adapt to any new situation, but in order to do so it first has to learn a model of the entire world. An agent based on GPE and GPI offers an intermediate solution: although the model of the world it learns is considerably smaller than that of a model-based agent, it can quickly adapt to certain situations, often with good performance.

We discussed specific instantiations of GPE and GPI, but these are in fact more general concepts. At an abstract level, an agent using GPE and GPI proceeds in two steps. First, when faced with a new task, it asks: “How well would solutions to known tasks perform on this new task?” This is GPE. Then, based on this evaluation, the agent combines the previous solutions to construct a solution for the new task --that is, it performs GPI. The specific mechanics behind GPE and GPI are less important than the principle itself, and finding alternative ways to carry out these operations may be an exciting research direction. Interestingly, a new study in behavioural sciences provides preliminary evidence that humans make decisions in multitask scenarios following a principle that closely resembles GPE and GPI.

The fast adaptation provided by GPE and GPI is promising for building faster learning RL agents. More generally, it suggests a new approach to learning flexible solutions to problems. Instead of tackling a problem as a single, monolithic, task, an agent can break it down into smaller, more manageable, sub-tasks. The solutions of the sub-tasks can then be reused and recombined to solve the overall task faster. This results in a compositional approach to RL that may lead to more scalable agents. At the very least, these agents will not be late because of a cup of coffee.