Offline Reinforcement Learning for Robotics
Soroush Nasiriany
Nov 3, 2021
UT Robot Learning Reading Group
1
Robot learning from offline data
2
Why learn from offline data
3
This tutorial: Offline RL
4
A slight detour: Online RL
5
Off-policy: replay buffer updated with new data over time
Offline RL is challenging
6
Fixed dataset
Fujimoto et al., “Off-Policy Deep Reinforcement Learning without Exploration”, ICML 2019
Addressing distributional shift
7
Levine et al., “Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems”, arXiv 2020
Batch Constrained Q-Learning (BCQ)
8
Fujimoto et al., “Off-Policy Deep Reinforcement Learning without Exploration”, ICML 2019
“Fix” suboptimal actions from behavior policy with learned perturbation model
Advantage Weighted Actor Critic (AWAC)
9
Nair et al., “AWAC: Accelerating Online Reinforcement Learning with Offline Datasets”, arXiv 2020
(concurrent work) Wang et al., “Critic Regularized Regression”, NeurIPS 2020
Weighted Behavior Cloning!
Conservative Q-Learning (CQL)
10
Standard Bellman Update
Push down Q values for current policy
Push up Q values for behavior policy
Kumar et al., “Conservative Q-Learning for Offline Reinforcement Learning”, NeurIPS 2020
An empirical study of Offline RL
11
Singh et al., “COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning”, CoRL 2020
An empirical study of Offline RL
12
Singh et al., “COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning”, CoRL 2020
Should you use Offline RL or BC
13
Offline RL | Behavior Cloning (BC) |
+ Can outperform behavior policy + Amenable to compositionality: implicitly “stitching” behaviors from unrelated experiences - Requires rewards - Difficulty with non-Markovian data - Limited by quality of dataset: does not address unseen states | + Simple: fewer hyperparameters and model complexities + Can outperform current Offline RL methods even with mixed quality data [1] - Relies on somewhat optimal data - More susceptible to covariate shift? |
[1] Mandlekar et al., “What Matters in Learning from Offline Human Demonstrations for Robot Manipulation”, CoRL 2021
Challenges to address going forward
14
Relevant References
Levine et al., “Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems”, arXiv 2020
Fu et al., “D4RL: Datasets for Deep Data-Driven Reinforcement Learning”, arXiv 2020
Fujimoto et al., “Off-Policy Deep Reinforcement Learning without Exploration”, ICML 2019
Kumar et al., “Conservative Q-Learning for Offline Reinforcement Learning”, NeurIPS 2020
Nair et al., “AWAC: Accelerating Online Reinforcement Learning with Offline Datasets”, arXiv 2020
Wang et al., “Critic Regularized Regression”, NeurIPS 2020
Yu et al., “COMBO: Conservative Offline Model-Based Policy Optimization”, NeurIPS 2021
Singh et al., “COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning”, CoRL 2020
Mandlekar et al., “What Matters in Learning from Offline Human Demonstrations for Robot Manipulation”, CoRL 2021
Cabi et al., “Scaling data-driven robotics with reward sketching and batch reinforcement learning”, arXiv 2019
Zolna et al., “Offline Learning from Demonstrations and Unlabeled Experience”, arXiv 2020
Ajay et al., “Opal: Offline primitive discovery for accelerating offline reinforcement learning”, ICLR 2021
Chebotar et al., “Actionable Models: Unsupervised Offline Reinforcement Learning of Robotic Skills”, arXiv 2021
Mandlekar et al., “IRIS: Implicit Reinforcement without Interaction at Scale for Learning Control from Offline Robot Manipulation Data”, ICRA 2020
Kostrikov et al., “Offline Reinforcement Learning with Implicit Q-Learning”, arXiv 2021
15