An Introduction to Semi-supervised Reinforcement Learning

A great overview of semi-supervised reinforcement learning, including general discussion and implementation information.



The obvious way to study semi-supervised RL is to try and do it:

  1. Start with any classic RL environment. For example, OpenAI recently published an awesome library here. The Atari, MuJoCo, and Classic Control environments are especially appropriate.
  2. Rather than providing the rewards to the learner in every episode, provide them only when the learner makes a request (for example, have return the sequence of rewards in the most recent episode)
  3. Measure performance as a function of both #episodes and #labels. For example measure performance as a function of (N + 1000F), where N is the number of episodes and F is the number of episodes on which feedback is requested.

These modifications are easy to make to the standard RL setup with just a few lines of code.



It’s easy to come up with some plausible approaches to semi-supervised RL. For example:

  • Train a model to estimate the total reward of an episode, and use it to estimate the payoff of unlabelled episodes or to reduce variance of the normalized feedback estimator.
  • Combine the above with traditional semi-supervised learning, to more quickly learn the reward estimator.
  • Use the observations of the transition function in unlabelled episodes to make more accurate Bellman updates.
  • Learn an estimator for the per-step reward.
  • Use model-based RL and train the model with data from unlabelled episodes.

I expect the first strategy to be the simplest thing to get working. It will of course work especially well in environments where the cost function is easy to estimate from the environment. For example, in Atari games we could learn to read the score directly from the screen (and this is closer to how a human would learn to play the game).

More interesting experiments

Even very simple examples are interesting from the perspective of AI control, but more complex environments would be more interesting:

  • Training an agent when the most effective reward estimator is manipulable. For example, consider a game where moving your character next to the score display appears to increase the score but has no effect on the actual score. We would like to train an agent not to bother modifying the displayed score.
  • Training an agent to use a reward estimator which requires effort to observe. For example, consider a game that only displays the score when the game is paused.
  • Using human feedback to define a reward function that has good but imperfect estimators. For example, we could teach an agent to play Pac-Man but to eat the power pellets as late as possible or to spend as much time as possible near the red ghost.
  • Providing a sequence of increasingly accurate (and increasingly expensive) reward estimators. I think this is the most natural approach to generalizing from sloppy evaluations to detailed evaluations.


I think that semi-supervised is an unusually tractable and interesting problem in AI control, and is also a natural problem in reinforcement learning. There are simple experiments to do now and a range of promising approaches to try. Even simple experiments are interesting from a control perspective, and there are natural directions to scale them up to more compelling demonstrations and feasibility tests.

(This research was supported as part of the Future of Life Institute FLI-RFP-AI1 program, grant #2015–143898.)

Bio: Paul Christiano is a Ph.D. candidate in theoretical computer science at UC Berkeley. His research focuses on learning theory and algorithms, and has received top awards at the Symposium on the Theory of Computing. He is interested in technical questions bearing on the social impact of artificial intelligence.

Original. Reposted with permission.