Exploration in RL
9:00 - 17:30
Audience: Please sit in front of the center-front screen.
Authors: Please use any poster board on this side
Overview
9:00 - 9:30: Keynote: Doina Precup
9:30 - 10:00: Spotlights
10:00 - 11:00: Poster Session #1
11:00 - 11:30: Speaker: Emo Todorov
11:30 - 12:00: Best Paper Awards
12:00 - 12:30: Speaker: Pieter Abbeel
12:30 - 14:00: Lunch
14:00 - 14:30: Speaker: Raia Hadsell
14:30 - 15:00: Lightning Talks
15:00 - 16:00: Poster Session #2
16:00 - 16:30: Speaker: Martha White
16:30 - 17:30: Panel Discussion
Keynote: Doina Precup
9:00 - 9:30
Spotlights
(5 x 5 min)
9:30 - 10:00
Overcoming Exploration With Play
Corey Lynch, Mohi Khansari, Ted Xiao, Vikash Kumar, Jonathan Tompson, Sergey Levine, Pierre Sermanet
Problem:
One robot, many tasks
“Playing” data for training
collected from human tele-operation
(2.5x speedup)
current
entire sequence
goal
latent plan
distribution space
plan
proposal
plan
recognition
KL divergence
minimization
2. Learn latent plans
using self-supervision
goal
current
latent plan
(sampled)
action likelihood
action αᵗ
action
decoder
3. Decode plan
to reconstruct actions
Training Play-LMP
action αᵗ
αᵗ⁺ᵐ
αⁿ
Replay Buffer
unlabeled & unsegmented play videos & actions
18 tasks (for evaluation only)
Examples of success runs for Play-LMP
Goal
Play-LMP policy
1x
(task: sliding)
Examples of success runs for Play-LMP
Goal
Play-LMP policy
1x
(task: sweep)
Composing 2 skills: grasp flat + drop in trash
Goals
Play-LMP policy
1x
=
+
8 skills in a row
Goal
Play-LMP policy
1.5x
Thank you!
Optimistic Exploration with Pessimistic Initialisation
Tabish Rashid, Bei Peng, Wendelin Boehmer, Shimon Whiteson
Motivations
Optimistic Init with Neural Networks
Is this bad?
Separating Optimism from Q-Value approximation
Function approximator
Count-based source of optimism
Tabular Regret Bounds
[1] Jin, Chi, et al. "Is q-learning provably efficient?." Advances in Neural Information Processing Systems. 2018.
Scaling to Deep RL
[2]Tang, Haoran, et al. "# Exploration: A study of count-based exploration for deep reinforcement learning." �Advances in neural information processing systems. 2017.
Maze Results
Scheduled Intrinsic Drive: A Hierarchical Take on Intrinsically Motivated Exploration
Jingwei Zhang*, Niklas Wetzel*, Nicolai Dorka*, Joschka Boedecker, Wolfram Burgard
Sparse Reward Reinforcement Learning
Intrinsic Rewards
Limitation of Current Intrinsic Reward Formulations
Successor Feature Control
: Some feature embedding of the state
Scheduled Intrinsic Drive
Experiments: Doom
FlytrapEscape
MyWayHome
Results
MyWayHome
FlytrapEscape
Thank you for your attention!
Link to updated version of the paper: https://arxiv.org/abs/1903.07400
More experiments at the poster
Generative Exploration and Exploitation
Jiechuan Jiang, Zongqing Lu
The Journey is the Reward: Unsupervised Learning of Influential Trajectories
Jonathan Binas, Sherjil Ozair,
Yoshua Bengio
Rather than just discovering new outcomes, learn how to achieve them.
Empowerment: learn controllability by maximizing I(a1, a2, ...; oT).
Instead of considering single step or single target observation, consider trajectories:�I(a1, a2, …; f(o1, o2, …))
Rather than just discovering new outcomes, learn how to achieve them.
Empowerment: learn controllability by maximizing I(a1, a2, ...; oT).
Instead of considering single step or single target observation, consider trajectories:�I(a1, a2, …; f(o1, o2, …))
1
2
3
Rather than just discovering new outcomes, learn how to achieve them.
Empowerment: learn controllability by maximizing I(a1, a2, ...; oT).
Instead of considering single step or single target observation, consider trajectories:�I(a1, a2, …; f(o1, o2, …))
Partial observability and high-dimensional actions
The Journey is the Reward: Unsupervised Learning�of Influential Trajectories
Jonathan Binas, Sherjil Ozair, Yoshua Bengio
Poster Session #1
10:00 - 11:00
Invited Speaker:
Emo Todorov
11:00 - 11:30
Best Paper Awards
11:30 - 12:00
Best Paper Awards
Benchmarking Bonus-Based Exploration Methods on the Arcade Learning Environment. Adrien Ali Taiga, Marc G. Bellemare, Aaron Courville, Liam Fedus, Marlos C. Machado
Simple Regret Minimization for Contextual Bandits. Aniket Anand Deshmukh, Srinagesh Sharma, James Cutler, Mark Moldwin, Clayton Scott
Benchmarking Bonus-Based Exploration Methods on the Arcade Learning Environment.
Adrien Ali Taïga, William Fedus, Marlos C. Machado, Aaron Courville, Marc G. Bellemare
Arcade Learning Environment (ALE)
�Bellemare et. al, 2013
Deep Q-Networks
Hard exploration games
�Mnih et. al 2015
Bellemare, Srinivasan, Ostrovski, Schaul, Saxton, Munos (2016)
Recent improvements
… and many more!
�VentureBeat, 2018
Look back at the progress made in exploration in the ALE.
Exploration method wish list
Evaluating exploration methods
Comparisons done
Reported results are hard to interpret!
Proposal
Evaluation methodology
Methods studied
Bonus-based:
Baselines:
Empirical results
Evaluation
~6400
~5750
~12660
>5000
~46670
Evaluation
Easy exploration games
Atari training set:
Asterix - Seaquest - H.E.R.O - Space Invaders - Beam Rider - Freeway
Takeaways
Joint work with
Thank you for your attention!
Simple Regret Minimization for Contextual Bandits
Aniket Anand Deshmukh, Srinagesh Sharma, James Cutler, Mark Moldwin, Clayton Scott
Invited Speaker:
Pieter Abbeel
12:00 - 12:30
Lunch
12:30 - 14:00
Invited Speaker: Raia Hadsell
14:00 - 14:30
Lightning Talks
14:30 - 15:00
Format
12 speakers, 2 min each (no questions)
Please advance to next speaker’s slide.
Please hold applause until end.
Curious iLQR: Resolving Uncertainty in Model-based RL
Sarah Bechtle1, Akshara Rai2, Yixin Lin2, Ludovic Righetti1,3, Franziska Meier2
1Max Planck Institute for Intelligent Systems
2Facebook AI Research
3New York University
Motivation:
Exploration in Model-based RL might help escape model bias.
Approach:
→ seeking out actions that resolve uncertainty in the model leads to better performance and better model quality while staying close to the task to solve.
Evaluation:
7 DoF arm in simulation and on hardware for a free movement reaching task.
An Empirical and Conceptual Categorization of Value-based Exploration Methods
Niko Yasui1, Sungsu Lim1,
Cameron Linke1, Adam White1 2, Martha White1
1University of Alberta
2DeepMind
DQN:
etc...
Conceptual Properties
Q*
Q̃
(Approximate) upper confidence bounds
Q*
Thompson sampling
Simplifying Exploration
Some takeaways
Skew-Fit: State-Covering Self-Supervised Reinforcement Learning
Vitchyr H. Pong*; Murtaza Dalal*; Steven Lin*; Ashvin V. Nair, Shikhar Bahl; Sergey Levine�UC Berkeley
Skew-Fit: State-Covering Self-Supervised Reinforcement Learning
Motivation: Formal objective for autonomous exploration and controllability
Idea: Want agent to set diverse goals for itself
Method: Learn maximum-entropy goal distribution even in unknow, high-dimensional state spaces
Evaluation: Image-based, robot manipulation tasks
code: https://github.com/vitchyr/rlkit/
Optimistic Proximal Policy Optimization
Takahisa Imagawa (AIST)*
Takuya Hiraoka (NEC)
Yoshimasa Tsuruoka (The University of Tokyo)
Optimistic PPO: Motivations
uncertainty Bellman exploration (UBE) [1] is one of the methods �to alleviate the sparse reward problem
[1] Brendan O’Donoghue et.al. The uncertainty Bellman equation and exploration. �[2] Yuri Burda et.al. Exploration by random network distillation.�[3] John Schulman et.al. Proximal policy optimization algorithms.
Thus, we apply UBE to PPO and propose a new algorithm Optimistic PPO
Optimistic PPO: Experimental Results
high return
proposed
Exploration with Unreliable Intrinsic Reward in Multi-Agent Reinforcement Learning
Wendelin Boehmer
Tabish Rashid
Shimon Whiteson
Exploration with Unreliable Intrinsic Reward in Multi-Agent Reinforcement Learning
Wendelin Boehmer
Tabish Rashid
Shimon Whiteson
Exploration with Unreliable Intrinsic Reward in Multi-Agent Reinforcement Learning
Wendelin Boehmer
Tabish Rashid
Shimon Whiteson
Parameterized Exploration
Jesse Clifton, Lili Wu, Eric Laber (North Carolina State University)
Evidence (Normal MAB)
More evidence for Bernoulli-MAB, contextual bandit, MDP in the paper ...
Efficient Exploration in Side-Scrolling Video Games with Trajectory Replay
I-Huan Chiang*, Chung-Ming Huang,
Nien-Hu Cheng, Hsin-Yu Liu, Shi-Chun Tsai
(National Chiao Tung University,
Hsinchu, Taiwan)
Side-Scrolling Game
Scenarios in Side-Scrolling Games
Spring system
Moving platforms
Loop
Seesaw
Goal
Start
s
t
Penalty
TRM & TOM
Penalty
high
low
Results
Thank you for your attention
thumbg12856.cs06g@g2.nctu.edu.tw
Hypothesis Driven Exploration for Deep Reinforcement Learning
Caleb Chuck (UT Austin)*; Supawit Chockchowwat (UT Austin);
Scott Niekum (UT Austin)
Explore by proposing and evaluating hypotheses about controllable object interactions, starting from raw pixels and raw actions
Learning latent state representation for speeding up exploration
Giulia Vezzani (Istituto Italiano di Tecnologia)*; Abhishek Gupta (UC Berkeley); Lorenzo Natale (Istituto Italiano di Tecnologia); Pieter Abbeel (UC Berkeley)
Learning latent state representation for speeding up exploration
Maximum-entropy bonus for exploration
Multi-headed reward regression
Epistemic Risk-Sensitive Reinforcement Learning
Hannes Eriksson (Chalmers)*; Christos Dimitrakakis (Chalmers university of technology)
Utility functions
Epistemic risk
Risk that arises due to uncertainty about the model
Objective
Key results
Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities
Aristide Tossou*; Debabrota Basu ; Christos Dimitrakakis
(Chalmers university of technology)
UCRL-V vs state-of-art UCRL2
UCRL2 (Jaksch et al., 2010) regret:
Lower bound (Jaksch et al., 2010) :
UCRL-V:
Same time complexity and Same settings
UCRL-V key ingredients
Variance based Confidence Interval for all subset of next states
New episode started when the average number of states doubled reaches 1
Improved Tree Search for Automatic Program Synthesis
Aran Carmon1, Lior Wolf1,2
1Tel Aviv University
2Facebook AI Research (FAIR)
Using a variant of MCTS that leads to state of the art program synthesis results on two vastly different DSLs.
Multiple contributions:
Program Synthesis
?Program?
Inputs
Outputs
Filter(<0)
Map(*2)
Sort()
Reverse()
Input: [1, -7, -14, 12, 8, 18, -11]
Output: [-14, -22, -28]
Deepcoder
While (rightIsNotClear) {
If (markerPresent) {
pickMarker()
}
move()
}
putMarker()
move()
Input
⟴ ⚐
⛝⛝⛝⛝⛝⛝
Output
⚐⟴
⛝⛝⛝⛝⛝⛝
Karel
Using the Policy Network with MCTS
4 Visits
+Visit
+Visit
?
?
?
!
Init
✔
55%
40%
2%
Explored too much
Low probability assigned by network
Best candidate this time
?
?
?
Where P is the probability assigned by the network and N is the visits counts
Score = P / (N + 1)
1 Visit
0 Visits
+Visit
Partial Encoding
Dataset Pruning
Policy
Network
Next step should be:
Current environment state
Goal output state
Previous steps
LSTM
6%
Reverse
4%
Mul by 2
78%
Sort
…..
Original program | Shortened program: |
While (noMarkersPresent) { putMarker() } move() | putMarker() move() |
turnLeft() move() putMarker() Repeat (R=4) { turnLeft() } turnRight() | turnLeft() move() putMarker() turnRight() |
DSL Benchmark
MuleX: Disentangling Exploration and Exploitation in Deep Reinforcement Learning
Lucas Beyer1
Damien Vincent1
Olivier Teboul2
Matthieu Geist2
Olivier Pietquin2
Google Brain 1Zürich 2Paris
MuleX: Motivations
Most common way of doing exploration:
Q ↤ Rtask + β1Rbonus1 + β2Rbonus2 + …
Works but has unfortunate consequences:
MuleX: Approach
To explore or to exploit? … Why not both!?
Qtask ↤ Rtask Qbonus1 ↤ Rbonus1 Qbonus2 ↤ Rbonus2 …
And learn from shared experience.
MuleX: Evaluation
Poster Session #2
15:00 - 16:00
Invited Speaker: Martha White
16:00 - 16:30
Panel Discussion
16:30 - 17:30
Panelists
Pieter Abbeel
Doina Precup
Martha
White
Jeff Clune
Pulkit Agrawal
Thanks!
Speakers and Panelists: Martha White, Emo Todorov, Raia Hadsell, Doina Precup, Pieter Abbeel, Jeff Clune, Pulkit Agrawal
Organizers: Ben Eysenbach, Surya Bhupatiraju, Shane Gu, Harri Edwards, Martha White, Pierre-Yves Oudeyer, Emma Brunskill, Kenneth Stanley
Website: Papers, slides, and recorded talks: https://sites.google.com/view/erl-2019/
Sponsors: