2 of 15

Robot learning from offline data

Enabling robots to leverage already existing datasets to more efficiently and effectively learn new tasks
Types of prior data:

Expert human demonstrations
Random / scripted policy data
Task-agnostic, aka “play” data
Data from different (but related) tasks
“In the wild” data, eg. YouTube videos

3 of 15

Why learn from offline data

Data-driven robot learning has emerged as a promising approach
But…data collection is a major bottleneck in robotics

IL: collecting demonstrations is tedious
RL: exploration burden
Real robots: hardware failures and safety

What about simulators?

Still need to provide demos or design reward functions
Content gap, compute limits, sim2real gap
Even without these issues, learning from scratch is naïve…

Make efficient use of prior data, don’t throw away anything!

4 of 15

This tutorial: Offline RL

Reinforcement Learning without online interaction
Instead, the agent has access to an offline dataset of transition tuples collected by a behavior policy
Objective: learn a policy that maximizes the expected sum of rewards, using offline dataset
Primary application: make use of highly suboptimal data (eg. play data, noisy demonstrations, expert demos from other tasks)

5 of 15

A slight detour: Online RL

Online, aka “standard” RL: agent alternates between data collection and model updates
Off-policy actor-critic methods – popular today (SAC, TD3, etc)

actor-critic: policy iteration
off-policy: transitions do not need to come from current policy

Off-policy: replay buffer updated with new data over time

6 of 15

Offline RL is challenging

Fujimoto et al. studied applying standard off-policy RL algorithms to offline RL

Findings: standard methods unsuitable for offline RL setting

Q function exploited at out-of-distribution (OOD) actions (but not states)
Learned Q values massively overestimate true Q values

Fixed dataset

Fujimoto et al., “Off-Policy Deep Reinforcement Learning without Exploration”, ICML 2019

7 of 15

Addressing distributional shift

Policy constraint methods: keep the learned policy close to the behavior policy so it can’t exploit Q values at unseen actions

Policy penalty methods: penalty incorporated into reward / Q values to avoid actions deviating from behavior policy

Uncertainty-based methods: prevent policy from exploiting “uncertain” actions in the model / Q function

Levine et al., “Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems”, arXiv 2020

8 of 15

Batch Constrained Q-Learning (BCQ)

A policy constraint method - in the policy improvement phase, only consider actions within the support of the dataset
Constraint enforced through by explicitly modeling behavior policy: learn conditional VAE for behavior policy
Policy improvement: maximize “perturbed” samples from cVAE

Fujimoto et al., “Off-Policy Deep Reinforcement Learning without Exploration”, ICML 2019

“Fix” suboptimal actions from behavior policy with learned perturbation model

9 of 15

Advantage Weighted Actor Critic (AWAC)

A policy constraint method with an implicit constraint: avoid explicitly modeling the behavior policy, more simple, effective

Nair et al., “AWAC: Accelerating Online Reinforcement Learning with Offline Datasets”, arXiv 2020

(concurrent work) Wang et al., “Critic Regularized Regression”, NeurIPS 2020

Weighted Behavior Cloning!

10 of 15

Conservative Q-Learning (CQL)

Rather than dealing with policy constraints and OOD actions, directly address the Q value overestimation issue
Learn Q values that are guaranteed to lower bound true Q values

Standard Bellman Update

Push down Q values for current policy

Push up Q values for behavior policy

Kumar et al., “Conservative Q-Learning for Offline Reinforcement Learning”, NeurIPS 2020

11 of 15

An empirical study of Offline RL

Stitching together datasets with offline RL for robotic manipulation tasks
Two datasets: (1) prior dataset with task-agnostic behavior and no rewards, and (2) task dataset with sparse rewards but limited state coverage
Objective: solve tasks from new initial states by using the prior dataset to the “extend” the set of initial states
Rewards from task dataset propagate to rewards from prior dataset, since these two datasets overlap in some areas

Singh et al., “COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning”, CoRL 2020

12 of 15

An empirical study of Offline RL

Singh et al., “COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning”, CoRL 2020

13 of 15

Should you use Offline RL or BC

Offline RL	Behavior Cloning (BC)
+ Can outperform behavior policy + Amenable to compositionality: implicitly “stitching” behaviors from unrelated experiences - Requires rewards - Difficulty with non-Markovian data - Limited by quality of dataset: does not address unseen states	+ Simple: fewer hyperparameters and model complexities + Can outperform current Offline RL methods even with mixed quality data [1] - Relies on somewhat optimal data - More susceptible to covariate shift?

[1] Mandlekar et al., “What Matters in Learning from Offline Human Demonstrations for Robot Manipulation”, CoRL 2021

14 of 15

Challenges to address going forward

Reward annotation
Addressing state mismatch as well as action mismatch
Handling non-Markovian, multi-modal datasets
Scaling up: how to leverage significantly larger datasets spanning a huge diversity of environments and tasks?

15 of 15

Relevant References

Levine et al., “Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems”, arXiv 2020

Fu et al., “D4RL: Datasets for Deep Data-Driven Reinforcement Learning”, arXiv 2020

Fujimoto et al., “Off-Policy Deep Reinforcement Learning without Exploration”, ICML 2019

Kumar et al., “Conservative Q-Learning for Offline Reinforcement Learning”, NeurIPS 2020

Nair et al., “AWAC: Accelerating Online Reinforcement Learning with Offline Datasets”, arXiv 2020

Wang et al., “Critic Regularized Regression”, NeurIPS 2020

Yu et al., “COMBO: Conservative Offline Model-Based Policy Optimization”, NeurIPS 2021

Singh et al., “COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning”, CoRL 2020

Mandlekar et al., “What Matters in Learning from Offline Human Demonstrations for Robot Manipulation”, CoRL 2021

Cabi et al., “Scaling data-driven robotics with reward sketching and batch reinforcement learning”, arXiv 2019

Zolna et al., “Offline Learning from Demonstrations and Unlabeled Experience”, arXiv 2020

Ajay et al., “Opal: Offline primitive discovery for accelerating offline reinforcement learning”, ICLR 2021

Chebotar et al., “Actionable Models: Unsupervised Offline Reinforcement Learning of Robotic Skills”, arXiv 2021

Mandlekar et al., “IRIS: Implicit Reinforcement without Interaction at Scale for Learning Control from Offline Robot Manipulation Data”, ICRA 2020

Kostrikov et al., “Offline Reinforcement Learning with Implicit Q-Learning”, arXiv 2021