Tutorial:
Self-Supervised Reinforcement Learning
Benjamin Eysenbach
Assistant Professor of Computer Science
June 12, 2025
1
Learning from data (experience) without labels (rewards, optimal actions).
Reinforcement learning is the future of machine learning
Promise of reinforcement learning:
Solving problems that humans today don't know how to solve.
ML today: taught (primarily) via examples.
2
Why aren't we there yet?
3
Capabilities of RL today
Future capabilities of RL
long horizons
sparse rewards
Why aren't we there yet?
4
long horizons
sparse rewards
exploration
No rewards yet – you're still missing the peak of the right tower…
space of possible behaviors is huge
Why aren't we there yet?
5
long horizons
sparse rewards
exploration
space of possible behaviors is huge
Image credit: Keenan Crane
Space of possible behaviors is huge.
6
Image credit: Keenan Crane
Space of possible behaviors is huge.
7
Policy parameters:
A knob to sweep over the space of behaviors.
8
A knob to sweep over the space of behaviors.
9
A knob to sweep over the space of behaviors.
10
Sharma, Archit, et al. "Dynamics-Aware Unsupervised Discovery of Skills." ICLR, 2020.
A knob to sweep over the space of behaviors.
11
Sharma, Archit, et al. "Dynamics-Aware Unsupervised Discovery of Skills." ICLR, 2020.
A knob to sweep over the space of behaviors.
12
Image credit: Keenan Crane
A knob to sweep over the space of behaviors.
Goal for today:
How to compress the enormous space of behaviors, so that it is easy to search over?
13
Learning this knob is a pretraining task done without human supervision
Comparison with intrinsic motivation
Intrinsic motivation
Goal for today:
How to compress the enormous space of behaviors, so that it is easy to search over?
14
[Achiam and Sastry, 2017; Barto, 2012; Colas et al., 2019; Jaques et al., 2019; Kulkarni et al., 2016; Mohamed and
Jimenez Rezende, 2015; Oudeyer et al., 2007; Schmidhuber, 2010; Singh et al., 2010; Stout et al., 2005]
Outline for today
15
Goal for today:
How to compress the enormous space of behaviors, so that it is easy to search over?
Skill learning as a game
An example algorithm
Does skill learning work?
Mathematically, what is this doing?
Are skills optimal?
Using skills for rapid adaptation
The frontier of skill learning
Disclaimer: many different perspectives on this topic
16
Behavioral foundation model
it's about exploration!
it's about representations!
it's about skills!
Preliminaries
Observations / states:
Actions:
Skill representation:
Policy:
17
Discrete (0, 1, …) or continuous (vector), sampled from p(z).
a skill
Note: defining a skill requires both a skill representation ( ) and policy parameters ( ).
Preliminaries: the discounted state occupancy measure
18
defined analogously.
[Puterman, 2014; Syed et al., 2008]
Skill learning as a game [Warde-Farley, 2018]
19
a skill
What skills emerge from this game?
20
Exploration:
skills should cover the state space.
Predictability:
can you guess what a skill will do?
How many bits of information can communicate to ?
21
A prototypical algorithm for skill learning
22
1. Sample skill.
Learned
discriminator
3. Discriminator estimates skill from state. Update to maximize accuracy.
environment
2. Collect one episode with this skill.
4. Update skill to maximize discriminator accuracy.
skill
[Achiam et al., 2018; BE et al., 2019; Co-Reyes et al., 2018; Gregor et al., 2016; Hansen et al., 2019; Sharma et al., 2019; Warde-Farley et al., 2018; …]
Key point: the RL policy generates its own rewards.
23
4. Update skill to maximize discriminator accuracy.
skill
[Achiam et al., 2018; BE et al., 2019; Co-Reyes et al., 2018; Gregor et al., 2016; Hansen et al., 2019; Sharma et al., 2019; Warde-Farley et al., 2018; …]
1. Sample skill.
Learned
discriminator
3. Discriminator estimates skill from state. Update to maximize accuracy.
environment
2. Collect one episode with this skill.
Does this work?
24
…
[BE et al, 2019]
Does this work?
25
[Sharma et al, 2019]
Does this work?
26
[Zheng et al, 2025]
Does this work? (disclaimer: unlabeled demos also used)
27
[Peng et al, 2022]
Does this work?
28
Outline for today
29
Goal for today:
How to compress the enormous space of behaviors, so that it is easy to search over?
Skill learning as a game
An example algorithm
Does skill learning work?
Mathematically, what is this doing?
Are skills optimal?
Using skills for rapid adaptation
The frontier of skill learning
Mathematically, we are optimizing mutual information.
Review of mutual information [Shannon, 1948]
30
Mathematically, we are optimizing mutual information.
Review of mutual information [Shannon, 1948]
31
Mathematically, we are optimizing mutual information.
Review of mutual information [Shannon, 1948]
Learning skills by optimizing
32
A trajectory,
The knobs for controlling trajectories/behaviors.
Mathematically, we are optimizing mutual information.
Learning skills by optimizing
Review of mutual information [Shannon, 1948]
33
Mathematically, we are optimizing mutual information.
What skills emerge from this game?
Learning skills by optimizing
34
Exploration:
skills should cover the state space.
Predictability:
can you guess what a skill will do?
Can you predict which skill generated a trajectory?
A constant, ignore this.
Maximizing MI is a cooperative game with two players.
Policy / skills
Learning skills by optimizing
Discriminator
35
Policy optimization
Apply RL to the reward
36
Discriminator optimization
z is discrete → classification.
z is continuous → regression.
37
Train with maximum likelihood.
Which mutual information should we use?
38
For Markov tasks, only visitation frequency matters.
For non-Markov tasks, state visitation order matters. [Achiam et al., 2018, …]
Works poorly in practice – discriminator looks at actions instead of states
Useful for learning skills that encode relative behaviors ("move to the left", "add another block to the tower"). [Gregor et al., 2016; Sharma et al., 2019; Zheng et al., 2024, …]
Alternatives to mutual information
39
Skill learning is hierarchical empowerment [Klyubin et al., 2005; Salge et al., 2014, …]
40
Skill learning (one option)
Empowerment (one option)
Choose actions that exert a high degree of influence over the future states.
Learn skills that exert a high degree of influence over future states.
Skill learning is hierarchical empowerment [Klyubin et al., 2005; Salge et al., 2014, …]
41
Skill learning (one option)
Empowerment (one option)
Outline for today
42
Goal for today:
How to compress the enormous space of behaviors, so that it is easy to search over?
Skill learning as a game
An example algorithm
Does skill learning work?
Mathematically, what is this doing?
Are skills optimal?
Using skills for rapid adaptation
The frontier of skill learning
Using skills to solve downstream tasks
43
Sharma, Archit, et al. "Dynamics-Aware Unsupervised Discovery of Skills." ICLR, 2020.
Skills as solutions.
[BE et al., 2018; Hansen et al., 2019; He et al., 2022; Park et al., 2023; …]
Using skills to solve downstream tasks
44
Skills as solutions.
Sharma, Archit, et al. "Dynamics-Aware Unsupervised Discovery of Skills." ICLR, 2020.
[BE et al., 2018; Hansen et al., 2019; He et al., 2022; Park et al., 2023; …]
Using skills to solve downstream tasks
45
Skills as solutions.
[Warde-Farley et al, 2018]
Using skills to solve downstream tasks
46
Skills as solutions.
[BE et al., 2018; Hansen et al., 2019; He et al., 2022; Park et al., 2023; …]
Using skills to solve downstream tasks
47
Skills as solutions.
[BE et al., 2018; Hansen et al., 2019; He et al., 2022; Park et al., 2023; …]
Using skills to solve downstream tasks
48
Skills as solutions.
Sequencing skills.
[Co-Reyes et al., 2018; Florensa et al., 2017; Gregor et al., 2016; Sharma et al., 2019; …]
Using skills to solve downstream tasks
49
Skills as solutions.
Sequencing skills.
One episode:
[Co-Reyes et al., 2018; Florensa et al., 2017; Gregor et al., 2016; Sharma et al., 2019; …]
Using skills to solve downstream tasks
50
Skills as solutions.
Sequencing skills.
[Sharma et al., 2019]
Using skills to solve downstream tasks
51
Skills as solutions.
Sequencing skills.
without skills
with skills
[Lee et al., 2020]
Using skills to solve downstream tasks
52
Skills as practice.
Skills as solutions.
Sequencing skills.
Theoretically, are skills learned by mutual information optimal for solving new tasks?
53
Expressive enough to represent many behaviors.
Satisfied by an untrained policy by defining z to be the parameters of a neural network.
→ Poor organization/compression of behaviors (e.g., highly redundant).
→ Fails to accelerate solving downstream tasks.
→ But, still seems like a nice property to have.
Theoretically, are skills learned by mutual information optimal for solving new tasks?
54
Enables rapid adaptation to new tasks.
✓
Each skill visits a certain distribution over states.
55
Each skill visits a certain distribution over states.
56
This orange region is the state marginal polytope.
Reward maximizing policies lie at vertices.
57
Reward maximizing policies lie at vertices.
58
Reward maximizing policies lie at vertices.
59
Where do skills lie on the polytope?
60
Where do skills lie on the polytope?
61
No skill here
[BE et al., 2021; also, see Alegre et al., 2022]
Where do skills lie on the polytope?
Skills lie at vertices of the state marginal polytope.
⇒ Skills are all optimal for some downstream reward functions.
MI-based skill learning will recover at most |S| distinct skills.
62
[BE et al., 2021; also, see Alegre et al., 2022]
A different perspective on skill learning
63
A different perspective on skill learning
64
70%
30%
0%
0%
A different perspective on skill learning
65
70%
30%
0%
0%
A different perspective on skill learning
66
70%
30%
Learning p(z) corresponds to learning some "candidate" state distribution that you can use for solving new tasks.
The difficulty of learning a new task depends on how far that tasks state distribution is from this prior.
67
Mutual information minimizes the "distance" to the furthest policy.
[Cover and Thomas, 2006; Gallager, 1979; Ryabko 1979]
68
Mutual information minimizes the "distance" to the furthest policy.
[Cover and Thomas, 2006; Gallager, 1979; Ryabko 1979]
69
Mutual information minimizes the "distance" to the furthest policy.
[Cover and Thomas, 2006; Gallager, 1979; Ryabko 1979]
70
Mutual information skill learning finds the optimal initialization for an unknown reward function.
71
Uniform Prior �[Lee et al, 2018; Hazan et al, 2018]
Prior from maximizing Mutual Information
Hard task.
[BE et al, 2021]
Outline for today
72
Goal for today:
How to compress the enormous space of behaviors, so that it is easy to search over?
Skill learning as a game
An example algorithm
Does skill learning work?
Mathematically, what is this doing?
Are skills optimal?
Using skills for rapid adaptation
The frontier of skill learning
Compression perspective of skills
73
compression
Foundation model
Vision, Language, Audio:
Behavioral foundation model
compression
RL:
(set of all behaviors)
joint optimization
Outlook:
Generalization is the next frontier for self-supervised RL
Papers today measure the number of things that skills do
74
What's missing?
75
LLMs and generative image models can generalize in this way – why not RL?
Outlook:
Generalization is the next frontier for self-supervised RL
76
New York City
Outlook:
Generalization is the next frontier for self-supervised RL
77
How can we represent an exponential number of behaviors?
Outlook:
Generalization is the next frontier for self-supervised RL
78
Simulators and digital twins are going to be important.
GpuDrive
… many others!
Outlook:
Generalization is the next frontier for self-supervised RL
79
Thank you!
80
Thank you!
Reach out to chat this week!
81
Questions?
Does this work?
82
[BE et al, 2019]