An Introduction to Semi-supervised Reinforcement Learning

A great overview of semi-supervised reinforcement learning, including general discussion and implementation information.

By Paul Christiano, UC Berkeley.

Semi-supervised RL is similar to traditional episodic RL, but there are two kinds of episodes:

  • “labelled” episodes, which are just like traditional episodes,
  • “unlabelled” episodes, where the agent does not get to see its rewards.

As usual, our goal is to quickly learn a policy which receives a high reward per episode. There are two natural flavors of semi-supervised RL:

  • Random labels: each episode is labelled with some fixed probability
  • Active learning: the agent can request feedback on its performance in any episode. The goal is to be economical both with feedback requests and total training time.

We can apply a traditional RL algorithm to the semi-supervised setting by simply ignoring all of the unlabelled episodes. This will generally result in very slow learning. The interesting challenge is to learn efficiently from the unlabelled episodes.

I think that semi-supervised RL is a valuable ingredient for AI control, as well as an interesting research problem in reinforcement learning.


Applications and motivation

Application to AI control: expensive reward functions

As a simple example, consider an RL system which learns from the user pressing a “reward button” — each time the agent performs a task well the user presses the button to let it know. (A realistic design would more likely use verbal approval, more subtle cues, or performance measures that don’t involve the user at all. But a very simple example makes the point clear.)

If our system is a competent RL agent maximizing button presses, it will eventually learn to deceive and manipulate the user into pressing the button, or to simply press the button itself.

We’d prefer that the system treated the button presses as information about what behavior is good, rather than the definition of what is good. Deceiving the user simply destroys the usefulness of that information.

This can be captured in the semi-supervised RL framework. Suppose that we have some expensive “ground truth” procedure that can reliably assess how good a system’s behavior really was. We can use this procedure to define the reward signal in a semi-supervised RL problem. The agent can then use the reward button presses to learn effectively from the “unlabelled” episodes, after recognizing that button presses provide useful information about the ground truth.

Of course designing such a ground truth is itself a serious challenge. But designing an expensive objective seems much easier than designing a cheap one, and handling expensive objectives seems key to building efficient aligned AI systems. Moreover, if we are freed to use an expensive ground truth, we can rely on extensive counterfactual oversight, including bootstrapping, opening up a promising family of solutions to the control problem.

If we have good algorithms for semi-supervised RL, then the expensiveness of the ground truth procedure won’t cause problems. The feedback efficiency of our semi-supervised RL algorithm determines just how expensive the ground truth can feasibly be.

Semi-supervised RL as an RL problem

Even setting aside AI control, semi-supervised RL is an interesting challenge problem for reinforcement learning. It provides a different angle on understanding the efficiency of reinforcement learning algorithms, and a different yardstick by which to measure progress towards “human-like” learning.

Methods for semi-supervised RL are also likely to be useful for handling sparsity and variance in reward signals more generally. Even if we are only interested in RL problems with full supervision, these are key difficulties. Isolating them in a simple environment can help us understand possible solutions.

Application to AI control: facilitating present work

In the short term, I think that counterfactual oversight and bootstrapping are worth exploring experimentally. Both involve optimizing an expensive ground truth, and so performing interesting experiments is already bottlenecked on competent semi-supervised RL.

Application to AI control: measuring success

Several AI control problems arise naturally in the context of semi-supervised RL:

  • Detecting context changes. An agent’s estimate of rewards may become inaccurate in a new context — for example, once the agent learns to perform a new kind of action. A successful agent for active semi-supervised RL must learn to recognize possible changes and query for feedback in response to those changes.
  • Handling uncertainty about the reward function. An efficient semi-supervised RL agent must behave well even when it doesn’t have a precise estimate for the reward function. In particular, it will sometimes have a much better and more stable model of the environment dynamics than of the reward function. This problem is especially interesting if we track performance during training rather than just measuring time to convergence.
  • Eliciting information and communicating. In many environments acquiring information about the unobserved reward may require behaving strategically. For example, an agent might ask questions of a human overseer, in order to efficiently learn about what the overseer would decide if they performed an extensive evaluation. The agent is motivated to communicate effectively so that the overseer can quickly reach accurate conclusions. This is a key behavior for aligned AI systems.

We can study these problems in a semi-supervised RL setup, where we have a precisely defined objective and can easily measure success. Having a clean framework for measuring performance may help close the gap between problems in AI control and traditional research problems in AI.