1 of 36

Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization

Tatsuya Matsushima1*, Hiroki Furuta1*, Yutaka Matsuo1,

Ofir Nachum2, Shixiang Shane Gu2

1The University of Tokyo, 2Google Brain (*Contributed Equally)

Contact: matsushima@weblab.t.u-tokyo.ac.jp

ICLR2021

2 of 36

Motivation: Reducing Cost & Risks for RL

Impressive success of reinforcement learning (RL) algorithms in sequential decision making

  • Depends on frequent data-collection & policy update

2

3 of 36

Motivation: Reducing Cost & Risks for RL

Impressive success of reinforcement learning (RL) algorithms in sequential decision making

  • Depends on frequent data-collection & policy update

But, potential risks and costs for deploying new exploratory policies

  • e.g. robot control, medicine and education

3

4 of 36

Related Framework: Offline RL

Offline RL learns policies only from a fixed dataset

4

5 of 36

Related Framework: Offline RL

Offline RL learns policies only from a fixed dataset

  • Assumes we already have some datasets with suboptimal performance
  • Usually not learning from scratch (random policy)

5

6 of 36

Deployment-Efficiency: A New Metric for RL

Counts # of “deployments” of the policy

6

7 of 36

Deployment-Efficiency: A New Metric for RL

Counts # of “deployments” of the policy

  • Even algorithms with high sample-efficiency, the deployment-efficiency may be very low e.g. SAC

7

8 of 36

The Problem & Solution

Problem: Low Sample-Efficiency in previous Offline RL methods

  • Simply repeating offline update is not applicable for limited deployment setting (from random policy)

8

9 of 36

The Problem & Solution

Problem: Low Sample-Efficiency in previous Offline RL methods

  • Simply repeating offline update is not applicable for limited deployment setting (from random policy)

Solution: Develop model-based offline RL methods & repeat optimizing it!

  • MBRL methods are sample-efficient in online RL

9

10 of 36

BREMEN (Behaviour-Regularized Model Ensemble)

We propose BREMEN

  • A model-based offline RL method
  • Achieve high sample- & deployment-efficiency

10

11 of 36

BREMEN (Behaviour-Regularized Model Ensemble)

We propose BREMEN

  • A model-based offline RL method
  • Achieve high sample- & deployment-efficiency

Tricks for efficient & stable policy improvement

  1. Learning policy with ensembled dynamics model
  2. Conservative policy update with behavior policy initialization

11

12 of 36

Trick 1. Model Ensemble

Learning policy from imaginally rollouts generated by dynamics model ensemble

12

13 of 36

Trick 1. Model Ensemble

Learning policy from imaginally rollouts generated by dynamics model ensemble

  • Prevent policy from exploiting model bias
  • Learn K dynamics models with different initialization from dataset with MSE

13

14 of 36

Trick 1. Model Ensemble

Learning policy from imaginally rollouts generated by dynamics model ensemble

  • Prevent policy from exploiting model bias
  • Learn K dynamics models with different initialization from dataset with MSE

  • In policy learning, randomly pick up one dynamics model and rollout next state for every step

14

15 of 36

Trick 2. Conservative Update with Regularization

Estimate behavior policy from dataset (BC)

15

16 of 36

Trick 2. Conservative Update with Regularization

Estimate behavior policy from dataset (BC)

KL trust-region policy update initialized with BC policy

16

17 of 36

Trick 2. Conservative Update with Regularization

Estimate behavior policy from dataset (BC)

KL trust-region policy update initialized with BC policy

  • Works as an implicit KL regularization, as opposed to explicit penalties to value or immediate reward in previous works

17

18 of 36

Overview of BREMEN

In limited deployment-setting, recursively apply offline BREMEN procedure

18

Trick 1. Rollout from ensembles

Trick 2. Initialization w/ BC policy

19 of 36

Experiments

19

20 of 36

Benchmarking Offline RL

Learn policies from fixed datasets of 1M steps with certain cumulative reward

  • Same experimental protocol as the previous work [Wu+19]

20

21 of 36

Benchmarking Offline RL

Learn policies from fixed datasets of 1M steps with certain cumulative reward

  • Same experimental protocol as the previous work [Wu+19]

Achieves competitive with SoTA model-free algorithms in locomotion tasks

21

22 of 36

Benchmarking Offline RL (D4RL)

Works comparably well with other model-free/model-based offline methods in more recent D4RL benchmarks [Fu+20]

22

23 of 36

Benchmarking Offline RL (D4RL)

Works comparably well with other model-free/model-based offline methods in more recent D4RL benchmarks [Fu+20]

  • n.b. Scores are normalized by the expert performance of each datasets

23

24 of 36

Sample-Efficiency in Offline RL

Works well with 10-20x smaller datasets!

24

25 of 36

Sample-Efficiency in Offline RL

Works well with 10-20x smaller datasets!

  • Previous methods are sometimes unstable and don’t exceed even datasets

25

26 of 36

Sample-Efficiency in Offline RL

Works well with 10-20x smaller datasets!

  • Previous methods are sometimes unstable and don’t exceed even datasets

BREMEN is a stable & sample-efficient offline RL method!

26

27 of 36

Deployment-Efficiency

Recursively applying offline RL method

  • Online learning from random dataset with deployment limits

27

28 of 36

Deployment-Efficiency

Recursively applying offline RL method

  • Online learning from random dataset with deployment limits

BREMEN (purple) achieves remarkable performance in limited-deployment settings

28

29 of 36

Deployment-Efficiency

BREMEN stably improves the policy also in manipulation tasks

29

30 of 36

Deployment-Efficiency

BREMEN stably improves the policy also in manipulation tasks

  • Satisfies practical requirements in robotics, sample- & deployment-efficiency

30

31 of 36

Qualitative Results

Locomotion (HalfCheetah)

31

1st Deployment

3rd

5th

32 of 36

Qualitative Results

Manipulation (FetchReach)

32

1st Deployment

6th

10th

33 of 36

Ablations: Effectiveness of Implicit KL Control

Explicit KL penalty w/o BC initialization moves farther away from last policy

  • α is coefficient for explicit KL penalty to the value
  • Suggesting the implicit regularization is more effective as conservative update

33

34 of 36

Ablations: Effectiveness of Implicit KL Control

Explicit KL penalty w/o BC initialization moves farther away from last policy

  • α is coefficient for explicit KL penalty to the value
  • Suggesting the implicit regularization is more effective as conservative update

34

35 of 36

Ablations: Effectiveness of Implicit KL Control

Explicit KL penalty w/o BC initialization moves farther away from last policy

  • α is coefficient for explicit KL penalty to the value
  • Suggesting the implicit regularization is more effective as conservative update

35

36 of 36

Summary

We propose BREMEN

  • Model-based offline RL algorithm with high sample-efficiency
  • Also achieves high deployment-efficiency

Future directions

  • Policy verification for safe & efficient data-collection
  • Applying to real robots

Code & pretrained-model

36