1 of 82

Tutorial:

Self-Supervised Reinforcement Learning

Benjamin Eysenbach

Assistant Professor of Computer Science

June 12, 2025

1

Learning from data (experience) without labels (rewards, optimal actions).

2 of 82

Reinforcement learning is the future of machine learning

Promise of reinforcement learning:

Solving problems that humans today don't know how to solve.

ML today: taught (primarily) via examples.

2

3 of 82

Why aren't we there yet?

3

Capabilities of RL today

Future capabilities of RL

long horizons

sparse rewards

4 of 82

Why aren't we there yet?

4

long horizons

sparse rewards

exploration

No rewards yet – you're still missing the peak of the right tower…

space of possible behaviors is huge

5 of 82

Why aren't we there yet?

5

long horizons

sparse rewards

exploration

space of possible behaviors is huge

Image credit: Keenan Crane

6 of 82

Space of possible behaviors is huge.

6

Image credit: Keenan Crane

7 of 82

Space of possible behaviors is huge.

7

Policy parameters:

8 of 82

A knob to sweep over the space of behaviors.

8

9 of 82

A knob to sweep over the space of behaviors.

9

10 of 82

A knob to sweep over the space of behaviors.

10

Sharma, Archit, et al. "Dynamics-Aware Unsupervised Discovery of Skills." ICLR, 2020.

11 of 82

A knob to sweep over the space of behaviors.

11

Sharma, Archit, et al. "Dynamics-Aware Unsupervised Discovery of Skills." ICLR, 2020.

12 of 82

A knob to sweep over the space of behaviors.

12

Image credit: Keenan Crane

13 of 82

A knob to sweep over the space of behaviors.

Goal for today:

How to compress the enormous space of behaviors, so that it is easy to search over?

  • Explore to discover all behaviors.
  • Represent behaviors for fast search.

13

Learning this knob is a pretraining task done without human supervision

14 of 82

Comparison with intrinsic motivation

Intrinsic motivation

  • E.g., curiosity, surprise, novelty.
  • Also used for pre-training.

Goal for today:

How to compress the enormous space of behaviors, so that it is easy to search over?

  • Explore to discover all behaviors.
  • Represent behaviors for fast search.

14

[Achiam and Sastry, 2017; Barto, 2012; Colas et al., 2019; Jaques et al., 2019; Kulkarni et al., 2016; Mohamed and

Jimenez Rezende, 2015; Oudeyer et al., 2007; Schmidhuber, 2010; Singh et al., 2010; Stout et al., 2005]

15 of 82

Outline for today

15

Goal for today:

How to compress the enormous space of behaviors, so that it is easy to search over?

Skill learning as a game

An example algorithm

Does skill learning work?

Mathematically, what is this doing?

Are skills optimal?

Using skills for rapid adaptation

The frontier of skill learning

16 of 82

Disclaimer: many different perspectives on this topic

16

Behavioral foundation model

it's about exploration!

it's about representations!

it's about skills!

17 of 82

Preliminaries

Observations / states:

Actions:

Skill representation:

Policy:

17

Discrete (0, 1, …) or continuous (vector), sampled from p(z).

a skill

Note: defining a skill requires both a skill representation ( ) and policy parameters ( ).

18 of 82

Preliminaries: the discounted state occupancy measure

18

defined analogously.

[Puterman, 2014; Syed et al., 2008]

19 of 82

Skill learning as a game [Warde-Farley, 2018]

19

a skill

20 of 82

What skills emerge from this game?

20

Exploration:

skills should cover the state space.

Predictability:

can you guess what a skill will do?

21 of 82

How many bits of information can communicate to ?

21

22 of 82

A prototypical algorithm for skill learning

22

1. Sample skill.

Learned

discriminator

3. Discriminator estimates skill from state. Update to maximize accuracy.

environment

2. Collect one episode with this skill.

4. Update skill to maximize discriminator accuracy.

skill

[Achiam et al., 2018; BE et al., 2019; Co-Reyes et al., 2018; Gregor et al., 2016; Hansen et al., 2019; Sharma et al., 2019; Warde-Farley et al., 2018; …]

23 of 82

Key point: the RL policy generates its own rewards.

23

4. Update skill to maximize discriminator accuracy.

skill

[Achiam et al., 2018; BE et al., 2019; Co-Reyes et al., 2018; Gregor et al., 2016; Hansen et al., 2019; Sharma et al., 2019; Warde-Farley et al., 2018; …]

1. Sample skill.

Learned

discriminator

3. Discriminator estimates skill from state. Update to maximize accuracy.

environment

2. Collect one episode with this skill.

24 of 82

Does this work?

24

[BE et al, 2019]

25 of 82

Does this work?

25

[Sharma et al, 2019]

26 of 82

Does this work?

26

[Zheng et al, 2025]

27 of 82

Does this work? (disclaimer: unlabeled demos also used)

27

[Peng et al, 2022]

28 of 82

Does this work?

28

29 of 82

Outline for today

29

Goal for today:

How to compress the enormous space of behaviors, so that it is easy to search over?

Skill learning as a game

An example algorithm

Does skill learning work?

Mathematically, what is this doing?

Are skills optimal?

Using skills for rapid adaptation

The frontier of skill learning

30 of 82

Mathematically, we are optimizing mutual information.

Review of mutual information [Shannon, 1948]

  1. You send signal x over a noisy wire, and receiver sees y. How much information does y tell you about x?
    • No noise → all the bits in x.
    • No signal → 0 bits.

30

31 of 82

Mathematically, we are optimizing mutual information.

  • Formally,

Review of mutual information [Shannon, 1948]

  • You send signal x over a noisy wire, and receiver sees y. How much information does y tell you about x?
    • No noise → all the bits in x.
    • No signal → 0 bits.
  • Many applications:

31

32 of 82

Mathematically, we are optimizing mutual information.

Review of mutual information [Shannon, 1948]

  • You send signal x over a noisy wire, and receiver sees y. How much information does y tell you about x?
    • No noise → all the bits in x.
    • No signal → 0 bits.

Learning skills by optimizing

  • Formally,
  • Many applications:

32

A trajectory,

The knobs for controlling trajectories/behaviors.

33 of 82

Mathematically, we are optimizing mutual information.

Learning skills by optimizing

  • Formally,

Review of mutual information [Shannon, 1948]

  • You send signal x over a noisy wire, and receiver sees y. How much information does y tell you about x?
    • No noise → all the bits in x.
    • No signal → 0 bits.
  • Many applications:

33

34 of 82

Mathematically, we are optimizing mutual information.

What skills emerge from this game?

Learning skills by optimizing

34

Exploration:

skills should cover the state space.

Predictability:

can you guess what a skill will do?

Can you predict which skill generated a trajectory?

A constant, ignore this.

35 of 82

Maximizing MI is a cooperative game with two players.

Policy / skills

Learning skills by optimizing

Discriminator

35

36 of 82

Policy optimization

Apply RL to the reward

36

37 of 82

Discriminator optimization

z is discrete → classification.

z is continuous → regression.

  • special case: z is normalized vector [Hansen et al., 2019; Park et al., 2023; Warde-Farley et al., 2018; Zheng et al., 2024; …]

37

Train with maximum likelihood.

38 of 82

Which mutual information should we use?

38

For Markov tasks, only visitation frequency matters.

For non-Markov tasks, state visitation order matters. [Achiam et al., 2018, …]

Works poorly in practice – discriminator looks at actions instead of states

Useful for learning skills that encode relative behaviors ("move to the left", "add another block to the tower"). [Gregor et al., 2016; Sharma et al., 2019; Zheng et al., 2024, …]

39 of 82

Alternatives to mutual information

  1. Hierarchical RL [Bacon et al., 2017; Kulkarni et al, 2016; Parr and Russell, 1997; …] – can be effective when rewards are given.
  • L2 distance, Wasserstein distance [He et al., 2022; Park et al., 2023; …] – allows users to specify a metric on the space of states.
  • Meta RL (learning to learn) [Beck et al, 2023; Hospedales et al., 2021; Gupta et al, 2018; …] – can be effective when many rewards are given.

39

40 of 82

Skill learning is hierarchical empowerment [Klyubin et al., 2005; Salge et al., 2014, …]

40

Skill learning (one option)

Empowerment (one option)

Choose actions that exert a high degree of influence over the future states.

Learn skills that exert a high degree of influence over future states.

41 of 82

Skill learning is hierarchical empowerment [Klyubin et al., 2005; Salge et al., 2014, …]

41

Skill learning (one option)

Empowerment (one option)

  • Think about skills as high-level actions.
  • Skill learning acquires skills that empower a random high-level policy.

42 of 82

Outline for today

42

Goal for today:

How to compress the enormous space of behaviors, so that it is easy to search over?

Skill learning as a game

An example algorithm

Does skill learning work?

Mathematically, what is this doing?

Are skills optimal?

Using skills for rapid adaptation

The frontier of skill learning

43 of 82

Using skills to solve downstream tasks

43

Sharma, Archit, et al. "Dynamics-Aware Unsupervised Discovery of Skills." ICLR, 2020.

Skills as solutions.

[BE et al., 2018; Hansen et al., 2019; He et al., 2022; Park et al., 2023; …]

44 of 82

Using skills to solve downstream tasks

44

Skills as solutions.

Sharma, Archit, et al. "Dynamics-Aware Unsupervised Discovery of Skills." ICLR, 2020.

[BE et al., 2018; Hansen et al., 2019; He et al., 2022; Park et al., 2023; …]

45 of 82

Using skills to solve downstream tasks

45

Skills as solutions.

[Warde-Farley et al, 2018]

46 of 82

Using skills to solve downstream tasks

46

Skills as solutions.

[BE et al., 2018; Hansen et al., 2019; He et al., 2022; Park et al., 2023; …]

47 of 82

Using skills to solve downstream tasks

47

Skills as solutions.

[BE et al., 2018; Hansen et al., 2019; He et al., 2022; Park et al., 2023; …]

48 of 82

Using skills to solve downstream tasks

48

Skills as solutions.

Sequencing skills.

[Co-Reyes et al., 2018; Florensa et al., 2017; Gregor et al., 2016; Sharma et al., 2019; …]

49 of 82

Using skills to solve downstream tasks

49

Skills as solutions.

Sequencing skills.

One episode:

[Co-Reyes et al., 2018; Florensa et al., 2017; Gregor et al., 2016; Sharma et al., 2019; …]

50 of 82

Using skills to solve downstream tasks

50

Skills as solutions.

Sequencing skills.

[Sharma et al., 2019]

51 of 82

Using skills to solve downstream tasks

51

Skills as solutions.

Sequencing skills.

without skills

with skills

[Lee et al., 2020]

52 of 82

Using skills to solve downstream tasks

52

Skills as practice.

Skills as solutions.

Sequencing skills.

  • Meta-learning – use skills as tasks to practice learning quickly. [(Gupta et al., 2018; Jabri et al., 2019; …]
  • Multi-agent cooperation – use skills as opponents to practice cooperating with.�[Chen, 2020; Lee et al., 2020; Szot et al., 2023; �Xin et al., 2023; …]

53 of 82

Theoretically, are skills learned by mutual information optimal for solving new tasks?

53

Expressive enough to represent many behaviors.

Satisfied by an untrained policy by defining z to be the parameters of a neural network.

→ Poor organization/compression of behaviors (e.g., highly redundant).

→ Fails to accelerate solving downstream tasks.

→ But, still seems like a nice property to have.

54 of 82

Theoretically, are skills learned by mutual information optimal for solving new tasks?

54

Enables rapid adaptation to new tasks.

55 of 82

Each skill visits a certain distribution over states.

55

56 of 82

Each skill visits a certain distribution over states.

56

This orange region is the state marginal polytope.

57 of 82

Reward maximizing policies lie at vertices.

57

58 of 82

Reward maximizing policies lie at vertices.

58

59 of 82

Reward maximizing policies lie at vertices.

59

60 of 82

Where do skills lie on the polytope?

60

61 of 82

Where do skills lie on the polytope?

61

No skill here

[BE et al., 2021; also, see Alegre et al., 2022]

62 of 82

Where do skills lie on the polytope?

Skills lie at vertices of the state marginal polytope.

Skills are all optimal for some downstream reward functions.

MI-based skill learning will recover at most |S| distinct skills.

62

[BE et al., 2021; also, see Alegre et al., 2022]

63 of 82

A different perspective on skill learning

63

64 of 82

A different perspective on skill learning

64

70%

30%

0%

0%

65 of 82

A different perspective on skill learning

65

70%

30%

0%

0%

66 of 82

A different perspective on skill learning

66

70%

30%

Learning p(z) corresponds to learning some "candidate" state distribution that you can use for solving new tasks.

67 of 82

The difficulty of learning a new task depends on how far that tasks state distribution is from this prior.

67

68 of 82

Mutual information minimizes the "distance" to the furthest policy.

[Cover and Thomas, 2006; Gallager, 1979; Ryabko 1979]

68

69 of 82

Mutual information minimizes the "distance" to the furthest policy.

[Cover and Thomas, 2006; Gallager, 1979; Ryabko 1979]

69

70 of 82

Mutual information minimizes the "distance" to the furthest policy.

[Cover and Thomas, 2006; Gallager, 1979; Ryabko 1979]

70

71 of 82

Mutual information skill learning finds the optimal initialization for an unknown reward function.

71

Uniform Prior �[Lee et al, 2018; Hazan et al, 2018]

Prior from maximizing Mutual Information

Hard task.

[BE et al, 2021]

72 of 82

Outline for today

72

Goal for today:

How to compress the enormous space of behaviors, so that it is easy to search over?

Skill learning as a game

An example algorithm

Does skill learning work?

Mathematically, what is this doing?

Are skills optimal?

Using skills for rapid adaptation

The frontier of skill learning

73 of 82

Compression perspective of skills

73

compression

Foundation model

Vision, Language, Audio:

Behavioral foundation model

compression

RL:

(set of all behaviors)

joint optimization

74 of 82

Outlook:

Generalization is the next frontier for self-supervised RL

Papers today measure the number of things that skills do

  1. As argued before, the number of skills is the wrong metric – we need to focus on whether we can solve new tasks quickly.
  2. We don't need to practice every behavior. Thought experiment:
    1. Hold out some z during training.

74

75 of 82

What's missing?

75

LLMs and generative image models can generalize in this way – why not RL?

76 of 82

Outlook:

Generalization is the next frontier for self-supervised RL

76

New York City

  • 100,000,000,000,000,000:�Number of unique paths across the city.
  • But, humans who have seen only a tiny fraction of paths can navigate effectively.
  • How? By finding patterns in this space of possible behaviors.

77 of 82

Outlook:

Generalization is the next frontier for self-supervised RL

77

How can we represent an exponential number of behaviors?

78 of 82

Outlook:

Generalization is the next frontier for self-supervised RL

78

Simulators and digital twins are going to be important.

GpuDrive

… many others!

79 of 82

Outlook:

Generalization is the next frontier for self-supervised RL

79

80 of 82

Thank you!

80

81 of 82

Thank you!

Reach out to chat this week!

eysenbach@princeton.edu

81

  • Slides
  • Full list of references
  • Written version of tutorial

Questions?

82 of 82

Does this work?

82

[BE et al, 2019]