1 of 123

Exploration in RL

9:00 - 17:30

Audience: Please sit in front of the center-front screen.

Authors: Please use any poster board on this side

2 of 123

Overview

9:00 - 9:30: Keynote: Doina Precup

9:30 - 10:00: Spotlights

10:00 - 11:00: Poster Session #1

11:00 - 11:30: Speaker: Emo Todorov

11:30 - 12:00: Best Paper Awards

12:00 - 12:30: Speaker: Pieter Abbeel

12:30 - 14:00: Lunch

14:00 - 14:30: Speaker: Raia Hadsell

14:30 - 15:00: Lightning Talks

15:00 - 16:00: Poster Session #2

16:00 - 16:30: Speaker: Martha White

16:30 - 17:30: Panel Discussion

3 of 123

Keynote: Doina Precup

9:00 - 9:30

4 of 123

Spotlights

(5 x 5 min)

9:30 - 10:00

5 of 123

Overcoming Exploration With Play

Corey Lynch, Mohi Khansari, Ted Xiao, Vikash Kumar, Jonathan Tompson, Sergey Levine, Pierre Sermanet

6 of 123

Problem:

One robot, many tasks

7 of 123

“Playing” data for training

collected from human tele-operation

(2.5x speedup)

8 of 123

current

entire sequence

goal

latent plan

distribution space

plan

proposal

plan

recognition

KL divergence

minimization

2. Learn latent plans

using self-supervision

goal

current

latent plan

(sampled)

action likelihood

action αᵗ

action

decoder

3. Decode plan

to reconstruct actions

Training Play-LMP

  1. Given unlabeled play data

action αᵗ

αᵗ⁺ᵐ

αⁿ

Replay Buffer

unlabeled & unsegmented play videos & actions

9 of 123

18 tasks (for evaluation only)

10 of 123

Examples of success runs for Play-LMP

Goal

Play-LMP policy

1x

(task: sliding)

11 of 123

Examples of success runs for Play-LMP

Goal

Play-LMP policy

1x

(task: sweep)

12 of 123

Composing 2 skills: grasp flat + drop in trash

Goals

Play-LMP policy

1x

=

+

13 of 123

8 skills in a row

Goal

Play-LMP policy

1.5x

14 of 123

  • Paper + videos: Learning-from-play.github.io

Thank you!

15 of 123

Optimistic Exploration with Pessimistic Initialisation

Tabish Rashid, Bei Peng, Wendelin Boehmer, Shimon Whiteson

16 of 123

Motivations

  • Optimistic Initialisation is an effective strategy for exploration in tabular RL.
  • Popular model-free Deep RL algorithms take inspiration from the tabular setting.
  • But DO NOT attempt optimistic initialisation.
  • Despite ALL provably efficient model-free algorithms rely on it.

17 of 123

Optimistic Init with Neural Networks

  • Why can’t we do optimistic initialisation with neural networks?��������
  • For an optimistic initialisation to benefit exploration the Q-Values for unseen state-action pairs must start high and remain high until they are visited.

18 of 123

Is this bad?

  • Assume a pessimistic initialisation (for a worst case outlook)
  • For non-negative rewards this is 0.�����
  • Without optimism we can fail on this simple 1 state 2 action MDP!

19 of 123

Separating Optimism from Q-Value approximation

  • If we can’t ensure sufficient optimism from our function approximator, let's have a separate source of optimism.��������
  • Use Q+ during action selection and bootstrapping.

Function approximator

Count-based source of optimism

20 of 123

Tabular Regret Bounds

  • Starting from UCB-H [1] in the finite horizon setting.
  • Pessimistically initialise the Q-Values at 0 (instead of at H)
  • Use Q+ for action selection and bootstrapping���
  • OPIQ: Optimistic Pessimistically Initialised Q-Learning
  • We can achieve the same regret bounds as UCB-H for M ≥ 1

[1] Jin, Chi, et al. "Is q-learning provably efficient?." Advances in Neural Information Processing Systems. 2018.

21 of 123

Scaling to Deep RL

  • OPIQ does not assume an optimistic initialisation
  • When extending it to Deep RL, we lose fewer crucial parts of the algorithm�
  • Base it upon DQN with pseudo-count based intrinsic motivation
  • To approximate the counts we use static hashing [2] for its generality and simplicity
    • Better results can be achieved with better approximate counting schemes�
  • Use Q+ for action selection and bootstrapping

[2]Tang, Haoran, et al. "# Exploration: A study of count-based exploration for deep reinforcement learning." �Advances in neural information processing systems. 2017.

22 of 123

Maze Results

23 of 123

Scheduled Intrinsic Drive: A Hierarchical Take on Intrinsically Motivated Exploration

Jingwei Zhang*, Niklas Wetzel*, Nicolai Dorka*, Joschka Boedecker, Wolfram Burgard

24 of 123

Sparse Reward Reinforcement Learning

  • Needs less reward shaping which:
    • Can induce unintended behaviour
    • Difficult to define for many real-world tasks (e.g. in robotics)
    • Needs a lot of supervision (e.g. setting up motion capture)
  • Less informative feedback
    • Structured exploration is needed

25 of 123

Intrinsic Rewards

  • The agent generates its own intrinsic rewards in order explore its environment in a structured way
  • Can solve tasks for which random exploration has a near zero chance of solving them

26 of 123

Limitation of Current Intrinsic Reward Formulations

  • No temporal extended exploration: Most approaches use only local information (e.g. one step prediction error)
  • Mixture policy: The intrinsic reward is added as a reward bonus, which leads to a mixture policy that neither acts greedily with regard to exploration nor to extrinsic reward maximization

27 of 123

Successor Feature Control

  • Idea: Use successor features to take temporally extended information into account
  • Successor Features:

: Some feature embedding of the state

  • Intrinsic Reward:

28 of 123

Scheduled Intrinsic Drive

  • Idea: Decouple exploration from extrinsic reward maximization
  • Hierarchical approach
  • Learn multiple policies with different reward functions
    • One policy maximizes extrinsic reward
    • One policy maximizes intrinsic reward
  • Schedule several times during each episode to follow one of the policies
  • Train both strategies off-policy with all experiences

29 of 123

Experiments: Doom

FlytrapEscape

MyWayHome

30 of 123

Results

MyWayHome

FlytrapEscape

31 of 123

Thank you for your attention!

Link to updated version of the paper: https://arxiv.org/abs/1903.07400

More experiments at the poster

32 of 123

Generative Exploration and Exploitation

Jiechuan Jiang, Zongqing Lu

33 of 123

The Journey is the Reward: Unsupervised Learning of Influential Trajectories

Jonathan Binas, Sherjil Ozair,

Yoshua Bengio

34 of 123

Rather than just discovering new outcomes, learn how to achieve them.

Empowerment: learn controllability by maximizing I(a1, a2, ...; oT).

Instead of considering single step or single target observation, consider trajectories:�I(a1, a2, …; f(o1, o2, …))

35 of 123

Rather than just discovering new outcomes, learn how to achieve them.

Empowerment: learn controllability by maximizing I(a1, a2, ...; oT).

Instead of considering single step or single target observation, consider trajectories:�I(a1, a2, …; f(o1, o2, …))

1

2

3

36 of 123

Rather than just discovering new outcomes, learn how to achieve them.

Empowerment: learn controllability by maximizing I(a1, a2, ...; oT).

Instead of considering single step or single target observation, consider trajectories:�I(a1, a2, …; f(o1, o2, …))

37 of 123

Partial observability and high-dimensional actions

38 of 123

The Journey is the Reward: Unsupervised Learning�of Influential Trajectories

Jonathan Binas, Sherjil Ozair, Yoshua Bengio

39 of 123

Poster Session #1

10:00 - 11:00

40 of 123

Invited Speaker:

Emo Todorov

11:00 - 11:30

41 of 123

Best Paper Awards

11:30 - 12:00

42 of 123

Best Paper Awards

Benchmarking Bonus-Based Exploration Methods on the Arcade Learning Environment. Adrien Ali Taiga, Marc G. Bellemare, Aaron Courville, Liam Fedus, Marlos C. Machado

Simple Regret Minimization for Contextual Bandits. Aniket Anand Deshmukh, Srinagesh Sharma, James Cutler, Mark Moldwin, Clayton Scott

43 of 123

Benchmarking Bonus-Based Exploration Methods on the Arcade Learning Environment.

Adrien Ali Taïga, William Fedus, Marlos C. Machado, Aaron Courville, Marc G. Bellemare

44 of 123

45 of 123

Arcade Learning Environment (ALE)

Bellemare et. al, 2013

46 of 123

Deep Q-Networks

Hard exploration games

Mnih et. al 2015

47 of 123

Bellemare, Srinivasan, Ostrovski, Schaul, Saxton, Munos (2016)

48 of 123

Recent improvements

  • Counts (Bellemare et al. 2016, Ostrovski et al. 2017, Machado et al. 2018)
  • Parameter Noise (Fortunato et al., Plappert et al.)
  • Thompson Sampling (Osband et al. 2016, Osband et al. 2018)
  • Prediction error (Pathak et al. 2017, Burda et al. 2018)

… and many more!

49 of 123

VentureBeat, 2018

50 of 123

Look back at the progress made in exploration in the ALE.

51 of 123

Exploration method wish list

  • Sample efficient
    • Solves the environment quickly.�
  • Robust
    • Should not require tuning for every new environment.�
  • Scalable
    • The ALE is a building block towards harder and more complex problems.

52 of 123

Evaluating exploration methods

Comparisons done

  • With different learning algorithms
    • E.g MC-DQN (value-based) vs PPO (policy-based)�
  • Using different amounts of data (e.g 200M vs 2B frames).�
  • Evaluating only on hard exploration games

Reported results are hard to interpret!

53 of 123

Proposal

  • Benchmark recent exploration methods.�
  • We focus on bonus-based methods. ��

54 of 123

Evaluation methodology

  • One agent, Rainbow (Hessel et al. 2017)�
  • Hyperparameters of the learning algorithm are kept fixed.�
  • Each agent is trained for 200M frames.�
  • Agents are trained with sticky actions (Machado et al. 2017).�
  • Exploration methods are tuned on Montezuma’s Revenge

55 of 123

Methods studied

Bonus-based:

  • Pseudo counts
    • CTS; Bellemare et al. 2016, PixelCNN; Ostrovski et al. 2017
  • Intrinsic Curiosity Module (ICM; Pathak et al. 2017).
  • Random Network Distillation (RND; Burda et al. 2018).

Baselines:

  • ε-greedy exploration.
  • Noisy Networks (NoisyNets; Fortunato et al. 2017).

56 of 123

Empirical results

57 of 123

58 of 123

Evaluation

59 of 123

~6400

~5750

~12660

>5000

~46670

60 of 123

Evaluation

61 of 123

Easy exploration games

Atari training set:

Asterix - Seaquest - H.E.R.O - Space Invaders - Beam Rider - Freeway

62 of 123

63 of 123

Takeaways

64 of 123

65 of 123

Joint work with

66 of 123

Thank you for your attention!

67 of 123

Simple Regret Minimization for Contextual Bandits

Aniket Anand Deshmukh, Srinagesh Sharma, James Cutler, Mark Moldwin, Clayton Scott

68 of 123

Invited Speaker:

Pieter Abbeel

12:00 - 12:30

69 of 123

Lunch

12:30 - 14:00

70 of 123

Invited Speaker: Raia Hadsell

14:00 - 14:30

71 of 123

Lightning Talks

14:30 - 15:00

72 of 123

Format

12 speakers, 2 min each (no questions)

Please advance to next speaker’s slide.

Please hold applause until end.

73 of 123

Curious iLQR: Resolving Uncertainty in Model-based RL

Sarah Bechtle1, Akshara Rai2, Yixin Lin2, Ludovic Righetti1,3, Franziska Meier2

1Max Planck Institute for Intelligent Systems

2Facebook AI Research

3New York University

74 of 123

Motivation:

Exploration in Model-based RL might help escape model bias.

Approach:

  • Learn a dynamics model that incorporates uncertainty
  • Include the model uncertainty in the optimization of the trajectory

→ seeking out actions that resolve uncertainty in the model leads to better performance and better model quality while staying close to the task to solve.

Evaluation:

7 DoF arm in simulation and on hardware for a free movement reaching task.

75 of 123

An Empirical and Conceptual Categorization of Value-based Exploration Methods

Niko Yasui1, Sungsu Lim1,

Cameron Linke1, Adam White1 2, Martha White1

1University of Alberta

2DeepMind

DQN:

  • Target network
  • Experience replay
  • Epsilon-greedy

etc...

76 of 123

Conceptual Properties

Q*

(Approximate) upper confidence bounds

Q*

Thompson sampling

77 of 123

Simplifying Exploration

  • Linear Q-learning agent with fixed features

  • Environments
    • Misleading reward
    • High-variance reward
    • Large state-action space
    • Repulsive transition dynamics

78 of 123

Some takeaways

  • ε-greedy seems to do well when
    • Exploration is safe
    • Rewards are placed along a path near the optimal policy

  • Both optimism and posterior sampling appear to exhaustively sweep large state-action spaces
  • Need to understand interactions between representation learning and exploration

79 of 123

Skew-Fit: State-Covering Self-Supervised Reinforcement Learning

Vitchyr H. Pong*; Murtaza Dalal*; Steven Lin*; Ashvin V. Nair, Shikhar Bahl; Sergey Levine�UC Berkeley

80 of 123

Skew-Fit: State-Covering Self-Supervised Reinforcement Learning

Motivation: Formal objective for autonomous exploration and controllability

Idea: Want agent to set diverse goals for itself

Method: Learn maximum-entropy goal distribution even in unknow, high-dimensional state spaces

Evaluation: Image-based, robot manipulation tasks

code: https://github.com/vitchyr/rlkit/

81 of 123

Optimistic Proximal Policy Optimization

Takahisa Imagawa (AIST)*

Takuya Hiraoka (NEC)

Yoshimasa Tsuruoka (The University of Tokyo)

82 of 123

Optimistic PPO: Motivations

uncertainty Bellman exploration (UBE) [1] is one of the methods �to alleviate the sparse reward problem

  • evaluates policy optimistically by the amount of the uncertainty of evaluation
  • more solid theoretical background than simply adding intrinsic reward �like random distillation network (RND) [2]
  • applied to SARSA, however there are more sophisticated algorithms �e.g proximal policy optimization (PPO) [3]

[1] Brendan O’Donoghue et.al. The uncertainty Bellman equation and exploration. �[2] Yuri Burda et.al. Exploration by random network distillation.�[3] John Schulman et.al. Proximal policy optimization algorithms.

Thus, we apply UBE to PPO and propose a new algorithm Optimistic PPO

83 of 123

Optimistic PPO: Experimental Results

high return

proposed

84 of 123

Exploration with Unreliable Intrinsic Reward in Multi-Agent Reinforcement Learning

Wendelin Boehmer

Tabish Rashid

Shimon Whiteson

  • Partial observable predator-prey task
  • Independent Q-learning slow

85 of 123

Exploration with Unreliable Intrinsic Reward in Multi-Agent Reinforcement Learning

Wendelin Boehmer

Tabish Rashid

Shimon Whiteson

  • Partial observable predator-prey task
  • Independent Q-learning slow
  • Intrinsic Reward introduces instability

86 of 123

Exploration with Unreliable Intrinsic Reward in Multi-Agent Reinforcement Learning

Wendelin Boehmer

Tabish Rashid

Shimon Whiteson

  • Introduce add. Exploration Agent
  • Training: same instability
  • Deployed: fast and stable

87 of 123

Parameterized Exploration

Jesse Clifton, Lili Wu, Eric Laber (North Carolina State University)

  • Problem: Exploration in finite horizon.
  • Intuitively, the agent explores less and less when approaching the end.
  • Idea: parameterizing exploration rate as a decreasing function of time t, and estimate its parameters using model-based rollouts.

88 of 123

  • Our exploration rate function

  • At each time t, estimate its parameters by solving

  • At time t, we can get an estimated environment.
  • In this estimated environment, different parameters will give different exploration rate sequence, then choose the best one.

89 of 123

Evidence (Normal MAB)

More evidence for Bernoulli-MAB, contextual bandit, MDP in the paper ...

90 of 123

Efficient Exploration in Side-Scrolling Video Games with Trajectory Replay

I-Huan Chiang*, Chung-Ming Huang,

Nien-Hu Cheng, Hsin-Yu Liu, Shi-Chun Tsai

(National Chiao Tung University,

Hsinchu, Taiwan)

91 of 123

Side-Scrolling Game

92 of 123

Scenarios in Side-Scrolling Games

Spring system

Moving platforms

Loop

Seesaw

93 of 123

Goal

Start

94 of 123

s

t

95 of 123

Penalty

96 of 123

TRM & TOM

Penalty

high

low

97 of 123

Idea

98 of 123

Results

99 of 123

Thank you for your attention

thumbg12856.cs06g@g2.nctu.edu.tw

100 of 123

Hypothesis Driven Exploration for Deep Reinforcement Learning

Caleb Chuck (UT Austin)*; Supawit Chockchowwat (UT Austin);

Scott Niekum (UT Austin)

Explore by proposing and evaluating hypotheses about controllable object interactions, starting from raw pixels and raw actions

101 of 123

102 of 123

Learning latent state representation for speeding up exploration

Giulia Vezzani (Istituto Italiano di Tecnologia)*; Abhishek Gupta (UC Berkeley); Lorenzo Natale (Istituto Italiano di Tecnologia); Pieter Abbeel (UC Berkeley)

  • Representation learning applied to exploration​
  • Prior experience to learn effective state representations
  • Entropy-based exploration method​
  • Learned representation helps narrowing the search space.​

103 of 123

Learning latent state representation for speeding up exploration

Maximum-entropy bonus for exploration

Multi-headed reward regression

104 of 123

Epistemic Risk-Sensitive Reinforcement Learning

Hannes Eriksson (Chalmers)*; Christos Dimitrakakis (Chalmers university of technology)

105 of 123

Utility functions

Epistemic risk

Risk that arises due to uncertainty about the model

Objective

106 of 123

Key results

  • We propose a novel framework for dealing with epistemic risk with tuneable risk-parameter

  • We introduce three ways of acting under epistemic uncertainty
    • Extended Value Iteration under epistemic risk
    • Epistemic Risk-Sensitive policy gradient
    • Bayesian Epistemic CVaR

107 of 123

Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities

Aristide Tossou*; Debabrota Basu ; Christos Dimitrakakis

(Chalmers university of technology)

108 of 123

UCRL-V vs state-of-art UCRL2

UCRL2 (Jaksch et al., 2010) regret:

Lower bound (Jaksch et al., 2010) :

UCRL-V:

Same time complexity and Same settings

109 of 123

UCRL-V key ingredients

Variance based Confidence Interval for all subset of next states

New episode started when the average number of states doubled reaches 1

110 of 123

Improved Tree Search for Automatic Program Synthesis

Aran Carmon1, Lior Wolf1,2

1Tel Aviv University

2Facebook AI Research (FAIR)

Using a variant of MCTS that leads to state of the art program synthesis results on two vastly different DSLs.

Multiple contributions:

  1. a modified visit count with shared states
  2. encoding the part of the program that was already executed
  3. a preprocessing procedure for the training dataset

111 of 123

Program Synthesis

?Program?

Inputs

Outputs

Filter(<0)

Map(*2)

Sort()

Reverse()

Input: [1, -7, -14, 12, 8, 18, -11]

Output: [-14, -22, -28]

Deepcoder

While (rightIsNotClear) {

If (markerPresent) {

pickMarker()

}

move()

}

putMarker()

move()

Input

⟴ ⚐

⛝⛝⛝⛝⛝⛝

Output

⚐⟴

⛝⛝⛝⛝⛝⛝

Karel

112 of 123

Using the Policy Network with MCTS

4 Visits

+Visit

+Visit

?

?

?

!

Init

55%

40%

2%

Explored too much

Low probability assigned by network

Best candidate this time

?

?

?

Where P is the probability assigned by the network and N is the visits counts

Score = P / (N + 1)

1 Visit

0 Visits

+Visit

113 of 123

Partial Encoding

Dataset Pruning

Policy

Network

Next step should be:

Current environment state

Goal output state

Previous steps

LSTM

6%

Reverse

4%

Mul by 2

78%

Sort

…..

Original program

Shortened program:

While (noMarkersPresent) {

putMarker()

}

move()

putMarker()

move()

turnLeft()

move()

putMarker()

Repeat (R=4) {

turnLeft()

}

turnRight()

turnLeft()

move()

putMarker()

turnRight()

114 of 123

DSL Benchmark

115 of 123

MuleX: Disentangling Exploration and Exploitation in Deep Reinforcement Learning

Lucas Beyer1

Damien Vincent1

Olivier Teboul2

Matthieu Geist2

Olivier Pietquin2

Google Brain 1Zürich 2Paris

116 of 123

MuleX: Motivations

Most common way of doing exploration:

Q ↤ Rtask + β1Rbonus1 + β2Rbonus2 + …

Works but has unfortunate consequences:

  • Changes the MDP we are solving!
  • Adds non-stationarity.
  • Will the final agent ever forget exploring?
  • Most likely trains longer than necessary.

117 of 123

MuleX: Approach

To explore or to exploit? … Why not both!?

Qtask ↤ Rtask Qbonus1 ↤ Rbonus1 Qbonus2 ↤ Rbonus2

And learn from shared experience.

118 of 123

MuleX: Evaluation

119 of 123

Poster Session #2

15:00 - 16:00

120 of 123

Invited Speaker: Martha White

16:00 - 16:30

121 of 123

Panel Discussion

16:30 - 17:30

122 of 123

Panelists

Pieter Abbeel

Doina Precup

Martha

White

Jeff Clune

Pulkit Agrawal

123 of 123

Thanks!

Speakers and Panelists: Martha White, Emo Todorov, Raia Hadsell, Doina Precup, Pieter Abbeel, Jeff Clune, Pulkit Agrawal

Organizers: Ben Eysenbach, Surya Bhupatiraju, Shane Gu, Harri Edwards, Martha White, Pierre-Yves Oudeyer, Emma Brunskill, Kenneth Stanley

Website: Papers, slides, and recorded talks: https://sites.google.com/view/erl-2019/

Sponsors: