Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization
Tatsuya Matsushima1*, Hiroki Furuta1*, Yutaka Matsuo1,
Ofir Nachum2, Shixiang Shane Gu2
1The University of Tokyo, 2Google Brain (*Contributed Equally)
Contact: matsushima@weblab.t.u-tokyo.ac.jp
ICLR2021
Motivation: Reducing Cost & Risks for RL
Impressive success of reinforcement learning (RL) algorithms in sequential decision making
2
Motivation: Reducing Cost & Risks for RL
Impressive success of reinforcement learning (RL) algorithms in sequential decision making
But, potential risks and costs for deploying new exploratory policies
3
Related Framework: Offline RL
Offline RL learns policies only from a fixed dataset
4
Related Framework: Offline RL
Offline RL learns policies only from a fixed dataset
5
Deployment-Efficiency: A New Metric for RL
Counts # of “deployments” of the policy
6
Deployment-Efficiency: A New Metric for RL
Counts # of “deployments” of the policy
7
The Problem & Solution
Problem: Low Sample-Efficiency in previous Offline RL methods
8
The Problem & Solution
Problem: Low Sample-Efficiency in previous Offline RL methods
Solution: Develop model-based offline RL methods & repeat optimizing it!
9
BREMEN (Behaviour-Regularized Model Ensemble)
We propose BREMEN
10
BREMEN (Behaviour-Regularized Model Ensemble)
We propose BREMEN
Tricks for efficient & stable policy improvement
11
Trick 1. Model Ensemble
Learning policy from imaginally rollouts generated by dynamics model ensemble
12
Trick 1. Model Ensemble
Learning policy from imaginally rollouts generated by dynamics model ensemble
13
Trick 1. Model Ensemble
Learning policy from imaginally rollouts generated by dynamics model ensemble
14
Trick 2. Conservative Update with Regularization
Estimate behavior policy from dataset (BC)
15
Trick 2. Conservative Update with Regularization
Estimate behavior policy from dataset (BC)
KL trust-region policy update initialized with BC policy
16
Trick 2. Conservative Update with Regularization
Estimate behavior policy from dataset (BC)
KL trust-region policy update initialized with BC policy
17
Overview of BREMEN
In limited deployment-setting, recursively apply offline BREMEN procedure
18
Trick 1. Rollout from ensembles
Trick 2. Initialization w/ BC policy
Experiments
19
Benchmarking Offline RL
Learn policies from fixed datasets of 1M steps with certain cumulative reward
20
Benchmarking Offline RL
Learn policies from fixed datasets of 1M steps with certain cumulative reward
Achieves competitive with SoTA model-free algorithms in locomotion tasks
21
Benchmarking Offline RL (D4RL)
Works comparably well with other model-free/model-based offline methods in more recent D4RL benchmarks [Fu+20]
22
Benchmarking Offline RL (D4RL)
Works comparably well with other model-free/model-based offline methods in more recent D4RL benchmarks [Fu+20]
23
Sample-Efficiency in Offline RL
Works well with 10-20x smaller datasets!
24
Sample-Efficiency in Offline RL
Works well with 10-20x smaller datasets!
25
Sample-Efficiency in Offline RL
Works well with 10-20x smaller datasets!
BREMEN is a stable & sample-efficient offline RL method!
26
Deployment-Efficiency
Recursively applying offline RL method
27
Deployment-Efficiency
Recursively applying offline RL method
BREMEN (purple) achieves remarkable performance in limited-deployment settings
28
Deployment-Efficiency
BREMEN stably improves the policy also in manipulation tasks
29
Deployment-Efficiency
BREMEN stably improves the policy also in manipulation tasks
30
Qualitative Results
Locomotion (HalfCheetah)
31
1st Deployment
3rd
5th
Qualitative Results
Manipulation (FetchReach)
32
1st Deployment
6th
10th
Ablations: Effectiveness of Implicit KL Control
Explicit KL penalty w/o BC initialization moves farther away from last policy
33
Ablations: Effectiveness of Implicit KL Control
Explicit KL penalty w/o BC initialization moves farther away from last policy
34
Ablations: Effectiveness of Implicit KL Control
Explicit KL penalty w/o BC initialization moves farther away from last policy
35
Summary
We propose BREMEN
Future directions
Code & pretrained-model
36