Accelerating Reinforcement Learning with Learned Skill Priors
Presented by Soroush Nasiriany
11/12/20
Motivation
Learning tasks from scratch with low-level actions is not scalable
Can we extract useful skills from past experiences and use them solve new tasks?
When solving tasks with these skills, we are restricting the exploration space of actions
However, even this space of skills is still too large and can make exploration challenging...
Can we be intelligent about which skills to use when learning a task?
Example
Skill-Prior RL (SPiRL)
Latent Space of Skills
Given an offline dataset of trajectories:
Consider random action sequences of length H:
Learn a latent variable model to captures this set of sequences with a Variational Autoencoder (VAE):
p(ai): distribution we’re trying to capture with the VAE
q(z | ai): encoder. encodes an action sequences into a skill embedding z
p(ai | z): decoder. Maps skill embedding to action sequences
p(z): prior distribution over latent space. Fixed as a normal Gaussian
Latent Space of Skills
Skill Prior
Now we have a latent space of skills. Learning to solve new tasks using this latent skill space is still challenging, as this space can be very large…
Idea: learn a skill prior that suggests which skills to execute for any given state
Skill Prior
For each action sequence ai, identify the first state in that sequence s. Learn a skill prior pa(z | s) that “matches” the encoded skill distribution q(z | ai)
Add the following loss term to the VAE objective:
We use D(q, pa) rather than D(pa, q) so that pa is mode covering rather than mode seeking
Full Model
RL with Skills
We now have a latent space of skills and a skill prior
Use RL to learn a task with a policy over latent skills:
When executing skill z, decode into H-step actions and execute those actions
Transitions stored in replay buffer are tuples (s, z, s’) where s and s’ are the beginning and end of the H-step action sequence
Reward now becomes sum of rewards over H-step rollout:
RL with Skills
Add to the reward objective an additional term encouraging the policy to pick latent actions suggested by the skill prior
Interpretation: original objective minimizes KL divergence with uniform policy
RL with Skills
Experiments
Can we extract skills from unstructured offline datasets and use them to effectively solve tasks in new domains?
Is the learned skill prior helpful for learning new tasks?
Domains
Baselines
SAC: vanilla RL, not using any offline datasets or learned skills
BC+ SAC: apply behavior cloning first on the offline data, then fine tune with SAC
Note: I think this is unfair. Should try BC + some on-policy RL algorithm
Flat Prior: The skills learned are over single step actions, rather than H-step action sequences. “BC-guided SAC” over the low-level actions
Skill Space Policy (SSP) w/o prior: RL over skill latent space, without skill prior
SPiRL: proposed method; ie. learn a skill space and a skill prior, and learn a policy in skill space with guidance from the skill prior
Results
Qualitative Analysis
Ablations
Takeaways
Offline datasets are helpful in general for solving complex tasks; SAC could not solve these tasks by itself
SPiRL can transfer knowledge from simpler tasks to more complex tasks
Temporal abstraction in skills is helpful (see the “flat prior” baseline)
Skill prior is essential to reducing burden of exploration (see the “SSP w/o prior” baseline)
Learned skills should not be too short or too long
Limitations
Extract semantically meaningful, variable length action sequences as skills
Will the skill prior transfer to new tasks with different observation spaces?
Learn a hierarchical space of skills: is one homogeneous space of skills enough?
Related Work
Discovering skills through exploration
Hausman et al. Learning an embedding space for transferable robot skills. ICLR, 2018
Eysenbach et al. Diversity is all you need. ICLR, 2019
Sharma et al. Dynamics-aware unsupervised discovery of skills. ICLR, 2020
Extracting skills from offline data
Lynch et al. learning latent plans from play. CoRL, 2019
Ajay et al. OPAL: offline primitive discovery for accelerating offline reinforcement learning. 2020
Imitation learning and fine-tuning with RL
Nair et al. Overcoming Exploration in Reinforcement Learning with Demonstrations. ICRA, 2018
Rajeswaran et al. Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations. RSS, 2018
Gupta et al. Relay Policy Learning. CoRL, 2019
Discussion
Should we discover skills from exploration, or from unstructured offline data?
What criteria constitute “good” skills that will be useful for many tasks?
What ingredients are needed to discover these skills?