1 of 15

Offline Reinforcement Learning for Robotics

Soroush Nasiriany

Nov 3, 2021

UT Robot Learning Reading Group

1

2 of 15

Robot learning from offline data

  • Enabling robots to leverage already existing datasets to more efficiently and effectively learn new tasks
  • Types of prior data:
    • Expert human demonstrations
    • Random / scripted policy data
    • Task-agnostic, aka “play” data
    • Data from different (but related) tasks
    • “In the wild” data, eg. YouTube videos

2

3 of 15

Why learn from offline data

  • Data-driven robot learning has emerged as a promising approach
  • But…data collection is a major bottleneck in robotics
    • IL: collecting demonstrations is tedious
    • RL: exploration burden
    • Real robots: hardware failures and safety
  • What about simulators?
    • Still need to provide demos or design reward functions
    • Content gap, compute limits, sim2real gap
    • Even without these issues, learning from scratch is naïve…
  • Make efficient use of prior data, don’t throw away anything!

3

4 of 15

This tutorial: Offline RL

  • Reinforcement Learning without online interaction
  • Instead, the agent has access to an offline dataset of transition tuples collected by a behavior policy
  • Objective: learn a policy that maximizes the expected sum of rewards, using offline dataset
  • Primary application: make use of highly suboptimal data (eg. play data, noisy demonstrations, expert demos from other tasks)

4

5 of 15

A slight detour: Online RL

  • Online, aka “standard” RL: agent alternates between data collection and model updates
  • Off-policy actor-critic methods – popular today (SAC, TD3, etc)
    • actor-critic: policy iteration
    • off-policy: transitions do not need to come from current policy

5

Off-policy: replay buffer updated with new data over time

6 of 15

Offline RL is challenging

  • Fujimoto et al. studied applying standard off-policy RL algorithms to offline RL

  • Findings: standard methods unsuitable for offline RL setting
    • Q function exploited at out-of-distribution (OOD) actions (but not states)
    • Learned Q values massively overestimate true Q values

6

Fixed dataset

Fujimoto et al., “Off-Policy Deep Reinforcement Learning without Exploration”, ICML 2019

7 of 15

Addressing distributional shift

  • Policy constraint methods: keep the learned policy close to the behavior policy so it can’t exploit Q values at unseen actions

  • Policy penalty methods: penalty incorporated into reward / Q values to avoid actions deviating from behavior policy

  • Uncertainty-based methods: prevent policy from exploiting “uncertain” actions in the model / Q function

7

Levine et al., “Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems”, arXiv 2020

8 of 15

Batch Constrained Q-Learning (BCQ)

  • A policy constraint method - in the policy improvement phase, only consider actions within the support of the dataset
  • Constraint enforced through by explicitly modeling behavior policy: learn conditional VAE for behavior policy
  • Policy improvement: maximize “perturbed” samples from cVAE

8

Fujimoto et al., “Off-Policy Deep Reinforcement Learning without Exploration”, ICML 2019

“Fix” suboptimal actions from behavior policy with learned perturbation model

9 of 15

Advantage Weighted Actor Critic (AWAC)

  • A policy constraint method with an implicit constraint: avoid explicitly modeling the behavior policy, more simple, effective

9

Nair et al., “AWAC: Accelerating Online Reinforcement Learning with Offline Datasets”, arXiv 2020

(concurrent work) Wang et al., “Critic Regularized Regression”, NeurIPS 2020

Weighted Behavior Cloning!

10 of 15

Conservative Q-Learning (CQL)

  • Rather than dealing with policy constraints and OOD actions, directly address the Q value overestimation issue
  • Learn Q values that are guaranteed to lower bound true Q values

10

Standard Bellman Update

Push down Q values for current policy

Push up Q values for behavior policy

Kumar et al., “Conservative Q-Learning for Offline Reinforcement Learning”, NeurIPS 2020

11 of 15

An empirical study of Offline RL

  • Stitching together datasets with offline RL for robotic manipulation tasks
  • Two datasets: (1) prior dataset with task-agnostic behavior and no rewards, and (2) task dataset with sparse rewards but limited state coverage
  • Objective: solve tasks from new initial states by using the prior dataset to the “extend” the set of initial states
  • Rewards from task dataset propagate to rewards from prior dataset, since these two datasets overlap in some areas

11

Singh et al., “COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning”, CoRL 2020

12 of 15

An empirical study of Offline RL

12

Singh et al., “COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning”, CoRL 2020

13 of 15

Should you use Offline RL or BC

13

Offline RL

Behavior Cloning (BC)

+ Can outperform behavior policy

+ Amenable to compositionality: implicitly “stitching” behaviors from unrelated experiences

- Requires rewards

- Difficulty with non-Markovian data

- Limited by quality of dataset: does not address unseen states

+ Simple: fewer hyperparameters and model complexities

+ Can outperform current Offline RL methods even with mixed quality data [1]

- Relies on somewhat optimal data

- More susceptible to covariate shift?

[1] Mandlekar et al., “What Matters in Learning from Offline Human Demonstrations for Robot Manipulation”, CoRL 2021

14 of 15

Challenges to address going forward

  • Reward annotation
  • Addressing state mismatch as well as action mismatch
  • Handling non-Markovian, multi-modal datasets
  • Scaling up: how to leverage significantly larger datasets spanning a huge diversity of environments and tasks?

14

15 of 15

Relevant References

Levine et al., “Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems”, arXiv 2020

Fu et al., “D4RL: Datasets for Deep Data-Driven Reinforcement Learning”, arXiv 2020

Fujimoto et al., “Off-Policy Deep Reinforcement Learning without Exploration”, ICML 2019

Kumar et al., “Conservative Q-Learning for Offline Reinforcement Learning”, NeurIPS 2020

Nair et al., “AWAC: Accelerating Online Reinforcement Learning with Offline Datasets”, arXiv 2020

Wang et al., “Critic Regularized Regression”, NeurIPS 2020

Yu et al., “COMBO: Conservative Offline Model-Based Policy Optimization”, NeurIPS 2021

Singh et al., “COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning”, CoRL 2020

Mandlekar et al., “What Matters in Learning from Offline Human Demonstrations for Robot Manipulation”, CoRL 2021

Cabi et al., “Scaling data-driven robotics with reward sketching and batch reinforcement learning”, arXiv 2019

Zolna et al., “Offline Learning from Demonstrations and Unlabeled Experience”, arXiv 2020

Ajay et al., “Opal: Offline primitive discovery for accelerating offline reinforcement learning”, ICLR 2021

Chebotar et al., “Actionable Models: Unsupervised Offline Reinforcement Learning of Robotic Skills”, arXiv 2021

Mandlekar et al., “IRIS: Implicit Reinforcement without Interaction at Scale for Learning Control from Offline Robot Manipulation Data”, ICRA 2020

Kostrikov et al., “Offline Reinforcement Learning with Implicit Q-Learning”, arXiv 2021

15