2 of 22

Motivation

Learning tasks from scratch with low-level actions is not scalable

Can we extract useful skills from past experiences and use them solve new tasks?

When solving tasks with these skills, we are restricting the exploration space of actions

However, even this space of skills is still too large and can make exploration challenging...

Can we be intelligent about which skills to use when learning a task?

4 of 22

Skill-Prior RL (SPiRL)

Given a set of offline demonstrations, learn a latent space of skills
Additionally, learn a skill prior on the latent space for any given state; this skill prior tells us which skills are more likely in any given state
Use RL to learn a high-level policy that outputs skill embeddings; incentivize algorithm to choose skills from the skill prior

5 of 22

Latent Space of Skills

Given an offline dataset of trajectories:

Consider random action sequences of length H:

Learn a latent variable model to captures this set of sequences with a Variational Autoencoder (VAE):

p(a_i): distribution we’re trying to capture with the VAE

q(z | a_i): encoder. encodes an action sequences into a skill embedding z

p(a_i | z): decoder. Maps skill embedding to action sequences

p(z): prior distribution over latent space. Fixed as a normal Gaussian

6 of 22

Latent Space of Skills

7 of 22

Skill Prior

Now we have a latent space of skills. Learning to solve new tasks using this latent skill space is still challenging, as this space can be very large…

Idea: learn a skill prior that suggests which skills to execute for any given state

8 of 22

Skill Prior

For each action sequence a_i, identify the first state in that sequence s. Learn a skill prior p_a(z | s) that “matches” the encoded skill distribution q(z | a_i)

Add the following loss term to the VAE objective:

We use D(q, p_a) rather than D(p_a, q) so that p_a is mode covering rather than mode seeking

10 of 22

RL with Skills

We now have a latent space of skills and a skill prior

Use RL to learn a task with a policy over latent skills:

When executing skill z, decode into H-step actions and execute those actions

Transitions stored in replay buffer are tuples (s, z, s’) where s and s’ are the beginning and end of the H-step action sequence

Reward now becomes sum of rewards over H-step rollout:

11 of 22

RL with Skills

Add to the reward objective an additional term encouraging the policy to pick latent actions suggested by the skill prior

Interpretation: original objective minimizes KL divergence with uniform policy

12 of 22

RL with Skills

13 of 22

Experiments

Can we extract skills from unstructured offline datasets and use them to effectively solve tasks in new domains?

Is the learned skill prior helpful for learning new tasks?

15 of 22

Baselines

SAC: vanilla RL, not using any offline datasets or learned skills

BC+ SAC: apply behavior cloning first on the offline data, then fine tune with SAC

Note: I think this is unfair. Should try BC + some on-policy RL algorithm

Flat Prior: The skills learned are over single step actions, rather than H-step action sequences. “BC-guided SAC” over the low-level actions

Skill Space Policy (SSP) w/o prior: RL over skill latent space, without skill prior

SPiRL: proposed method; ie. learn a skill space and a skill prior, and learn a policy in skill space with guidance from the skill prior

17 of 22

Qualitative Analysis

19 of 22

Takeaways

Offline datasets are helpful in general for solving complex tasks; SAC could not solve these tasks by itself

SPiRL can transfer knowledge from simpler tasks to more complex tasks

Temporal abstraction in skills is helpful (see the “flat prior” baseline)

Skill prior is essential to reducing burden of exploration (see the “SSP w/o prior” baseline)

Learned skills should not be too short or too long

20 of 22

Limitations

Extract semantically meaningful, variable length action sequences as skills

Will the skill prior transfer to new tasks with different observation spaces?

Learn a hierarchical space of skills: is one homogeneous space of skills enough?

21 of 22

Related Work

Discovering skills through exploration

Hausman et al. Learning an embedding space for transferable robot skills. ICLR, 2018

Eysenbach et al. Diversity is all you need. ICLR, 2019

Sharma et al. Dynamics-aware unsupervised discovery of skills. ICLR, 2020

Extracting skills from offline data

Lynch et al. learning latent plans from play. CoRL, 2019

Ajay et al. OPAL: offline primitive discovery for accelerating offline reinforcement learning. 2020

Imitation learning and fine-tuning with RL

Nair et al. Overcoming Exploration in Reinforcement Learning with Demonstrations. ICRA, 2018

Rajeswaran et al. Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations. RSS, 2018

Gupta et al. Relay Policy Learning. CoRL, 2019

22 of 22

Discussion

Should we discover skills from exploration, or from unstructured offline data?

What criteria constitute “good” skills that will be useful for many tasks?

What ingredients are needed to discover these skills?

1 of 22

2 of 22

3 of 22

4 of 22

5 of 22

6 of 22

7 of 22

8 of 22

9 of 22

10 of 22

11 of 22

12 of 22

13 of 22

14 of 22

15 of 22

16 of 22

17 of 22

18 of 22

19 of 22

20 of 22

21 of 22

22 of 22