1 of 123

Exploration in RL

9:00 - 17:30

Audience: Please sit in front of the center-front screen.

Authors: Please use any poster board on this side

Welcome to the workshop on exploration in reinforcement learning. My name is Ben Eysenbach, and together with Surya Bhupatiraju, are leading the organization of this workshop. I'm excited for the fantastic group of speakers and panelists we have joining us today.

Before getting started, we have a couple pieces of administravia

Talks are being livestreamed (link on the ICML website) and recorded. Press is in attendence.
Treat everyone with respect. If you see any sort of harassment or have any feedback, please contact Surya or myself, or submit an anonymous comment on the workshop website.
Third, I'd like to thank our co-organizers (Ben Eysenbach, Surya Bhupatiraju, Shane Gu, Harri Edwards, Martha White, Pierre-Yves Oudeyer, Emma Brunskill, Kenneth Stanley) and sponsors (DeepMind, Google).

Overview (next slide) (Surya)

2 of 123

Overview

9:00 - 9:30: Keynote: Doina Precup

9:30 - 10:00: Spotlights

10:00 - 11:00: Poster Session #1

11:00 - 11:30: Speaker: Emo Todorov

11:30 - 12:00: Best Paper Awards

12:00 - 12:30: Speaker: Pieter Abbeel

12:30 - 14:00: Lunch

14:00 - 14:30: Speaker: Raia Hadsell

14:30 - 15:00: Lightning Talks

15:00 - 16:00: Poster Session #2

16:00 - 16:30: Speaker: Martha White

16:30 - 17:30: Panel Discussion

3 of 123

Keynote: Doina Precup

9:00 - 9:30

4 of 123

Spotlights

(5 x 5 min)

9:30 - 10:00

5 of 123

Overcoming Exploration With Play

Corey Lynch, Mohi Khansari, Ted Xiao, Vikash Kumar, Jonathan Tompson, Sergey Levine, Pierre Sermanet

6 of 123

Problem:

One robot, many tasks

The problem setting we’re interested in is scaling up multi-task skill learning for robotics.
We’ve seen that robots can be very competent specialists. We’re now interested in the concept of a robot generalist: individual robots that can perform many different tasks.
We think this is a particularly important problem for a robot in a home, for example.
Conventionally there are 2 ways to get this behavior

1) Imitation learning: Define a set of tasks you care about, collect hundreds of demonstrations per task, and train one policy per task using supervised learning on the individual task datasets.
2) RL. Define a set of tasks you care about, define one reward function per task, and train one policy per task with RL.

We’d like to propose a different approach that we think is much more scalable because it doesn’t involve any hand-designing reward or task-specific demonstrations: self-supervising on cheap, unlabeled human play data.

7 of 123

“Playing” data for training

collected from human tele-operation

(2.5x speedup)

Why do we like human play data for robotics?

There’s some major practical benefits to collecting data this way over, say, positive demonstrations or scripted collection:

Cheap: no setup. There’s no scene staging, no segmented tasks, no resetting to an initial state. Can be collected in large quantities continuously.
General: There’s no discrete distribution of segmented tasks. It naturally contains functional and non-functional behavior. This means it naturally contains things like transitions between skills. Play contains discrete skills like door opening or button pushing, but it also contains everything in between.
Rich: Play is about means over ends. This means the operator should naturally experiment with different ways to achieve the same outcome, e.g. opening a door. Since the behavior is driven by the curiosity / boredom of the operator, we expect this to cover the interaction space well, and not get stuck on one kind of interaction.

Hidden problem

There are also some hidden challenges:

Multimodality: This kind of data is highly multimodal, by definition. This is undirected play with all the objects. Take the door being opened for example. The handle will be grasped from the top or the bottom. You’ll see the door opened quickly or slowly. Intuitively, there’s many different valid ways to go from point A to point B, and this is a major challenge to conventional BC in the form of counteracting gradients. So any method that attempts to ‘learn from play’ will have to account for this.

8 of 123

current

entire sequence

goal

latent plan

distribution space

plan

proposal

plan

recognition

KL divergence

minimization

2. Learn latent plans

using self-supervision

goal

current

latent plan

(sampled)

action likelihood

action αᵗ

action

decoder

3. Decode plan

to reconstruct actions

Training Play-LMP

Given unlabeled play data

action αᵗ

αᵗ⁺ᵐ

αⁿ

Replay Buffer

unlabeled & unsegmented play videos & actions

We’ll walk through a single training batch.

1) Select a random window of experience. Meaning, take all states and actions from time t to t+K, where K is a hyperparameter. In our case, we choose K=32, or around 1 second experiences.

2) Second, we’re going to try to handle all the variation in the play dataset by organizing it in an explicit embedding space. We call this a latent plan space. We can think of a single point in this space as a plan for how we’re going to get from a particular current state to a particular goal state. We’re going to learn this space with 2 stochastic encoders: a plan recognizer and a plan proposer.

First, we pass the entire state-action sequence through the plan recognition network, implemented as an RNN. Intuitively, this network recognizes the specific plan executed during play to get from initial state to final state. Think sweeping vs pick and place.
Second, the plan proposer takes the initial and final state from the same sequence and outputs a distribution over possible plans to reach goal from current. Think both sweeping and pick and place.
Finally, minimize kl divergence between the two stochastic encoders. Intuitively, make the plan proposer more likely to propose plans actually seen during play.

3. Finally, train a policy, conditioned on current state, goal state, and latent plan sampled from the recognition network to reconstruct the actions in the randomly selected window. Notice that this turns a highly multimodal problem into a unimodal problem, because the policy conditions on the plan. We can think of this as learning to decode your own plans in closed loop.

9 of 123

18 tasks (for evaluation only)

10 of 123

Examples of success runs for Play-LMP

Goal

Play-LMP policy

1x

(task: sliding)

11 of 123

Examples of success runs for Play-LMP

Goal

Play-LMP policy

1x

(task: sweep)

12 of 123

Composing 2 skills: grasp flat + drop in trash

Goals

Play-LMP policy

1x

=

+

13 of 123

8 skills in a row

Goal

Play-LMP policy

1.5x

14 of 123

Paper + videos: L earning-from-play.github.io

Thank you!

15 of 123

Optimistic Exploration with Pessimistic Initialisation

Tabish Rashid, Bei Peng, Wendelin Boehmer, Shimon Whiteson

16 of 123

Motivations

Optimistic Initialisation is an effective strategy for exploration in tabular RL.
Popular model-free Deep RL algorithms take inspiration from the tabular setting.
But DO NOT attempt optimistic initialisation.
Despite ALL provably efficient model-free algorithms rely on it.

17 of 123

Optimistic Init with Neural Networks

Why can’t we do optimistic initialisation with neural networks?��
For an optimistic initialisation to benefit exploration the Q-Values for unseen state-action pairs must start high and remain high until they are visited.

18 of 123

Is this bad?

Assume a pessimistic initialisation (for a worst case outlook)
For non-negative rewards this is 0.��
Without optimism we can fail on this simple 1 state 2 action MDP!

I mentioned earlier that in the tabular setting ALL provably efficient model-free algorithms rely heavily on an optimistic initialisation.

But in practise, do we need it to ensure good exploration?

Let us assume a pessimistic initialization of the Q-Values, for a worse case outlook, which for the case of non-negative rewards is 0.

Then lets consider an extremely simple 1 state 2 action mdp shown here.

With probability 0.5 an agent will take the left action, receive 0.1 reward and then update the Q-Value for going left to be > 0.

A greedy agent will then continue to take the left action forever.

Without optimism we can fail on this extremely simple MDP.

Now in practise we would use something like epsilon-greedy action selection, and so eventually an agent would take the right action and find the optimal policy.

But we don’t want to be reliant on something like epsilon-greedy taking random actions to ensure we explore enough.

19 of 123

Separating Optimism from Q-Value approximation

If we can’t ensure sufficient optimism from our function approximator, let's have a separate source of optimism.��
Use Q+ during action selection and bootstrapping.

Function approximator

Count-based source of optimism

20 of 123

Tabular Regret Bounds

Starting from UCB-H [1] in the finite horizon setting.
Pessimistically initialise the Q-Values at 0 (instead of at H)
Use Q+ for action selection and bootstrapping��
OPIQ: Optimistic Pessimistically Initialised Q-Learning
We can achieve the same regret bounds as UCB-H for M ≥ 1

[1] Jin, Chi, et al. "Is q-learning provably efficient?." Advances in Neural Information Processing Systems. 2018.

21 of 123

Scaling to Deep RL

OPIQ does not assume an optimistic initialisation
When extending it to Deep RL, we lose fewer crucial parts of the algorithm�
Base it upon DQN with pseudo-count based intrinsic motivation
To approximate the counts we use static hashing [2] for its generality and simplicity

Better results can be achieved with better approximate counting schemes�

Use Q+ for action selection and bootstrapping

[2]Tang, Haoran, et al. "# Exploration: A study of count-based exploration for deep reinforcement learning." �Advances in neural information processing systems. 2017.

22 of 123

Maze Results

23 of 123

Scheduled Intrinsic Drive: A Hierarchical Take on Intrinsically Motivated Exploration

Jingwei Zhang*, Niklas Wetzel*, Nicolai Dorka*, Joschka Boedecker, Wolfram Burgard

24 of 123

Sparse Reward Reinforcement Learning

Needs less reward shaping which:

Can induce unintended behaviour
Difficult to define for many real-world tasks (e.g. in robotics)
Needs a lot of supervision (e.g. setting up motion capture)

Less informative feedback

Structured exploration is needed

25 of 123

Intrinsic Rewards

The agent generates its own intrinsic rewards in order explore its environment in a structured way
Can solve tasks for which random exploration has a near zero chance of solving them

26 of 123

Limitation of Current Intrinsic Reward Formulations

No temporal extended exploration: Most approaches use only local information (e.g. one step prediction error)
Mixture policy: The intrinsic reward is added as a reward bonus, which leads to a mixture policy that neither acts greedily with regard to exploration nor to extrinsic reward maximization

27 of 123

Successor Feature Control

Idea: Use successor features to take temporally extended information into account
Successor Features:

: Some feature embedding of the state

Intrinsic Reward:

28 of 123

Scheduled Intrinsic Drive

Idea: Decouple exploration from extrinsic reward maximization
Hierarchical approach
Learn multiple policies with different reward functions

One policy maximizes extrinsic reward
One policy maximizes intrinsic reward

Schedule several times during each episode to follow one of the policies
Train both strategies off-policy with all experiences

29 of 123

Experiments: Doom

FlytrapEscape

MyWayHome

30 of 123

Results

MyWayHome

FlytrapEscape

31 of 123

Thank you for your attention!

Link to updated version of the paper: https://arxiv.org/abs/1903.07400

32 of 123

Generative Exploration and Exploitation

Jiechuan Jiang, Zongqing Lu

33 of 123

The Journey is the Reward: Unsupervised Learning of Influential Trajectories

Jonathan Binas, Sherjil Ozair,

Yoshua Bengio

34 of 123

Rather than just discovering new outcomes, learn how to achieve them.

Empowerment: learn controllability by maximizing I(a₁, a₂, ...; o_T).

Instead of considering single step or single target observation, consider trajectories:�I(a₁, a₂, …; f(o₁, o₂, …))

35 of 123

Rather than just discovering new outcomes, learn how to achieve them.

Empowerment: learn controllability by maximizing I(a₁, a₂, ...; o_T).

Instead of considering single step or single target observation, consider trajectories:�I(a₁, a₂, …; f(o₁, o₂, …))

1

2

3

36 of 123

Rather than just discovering new outcomes, learn how to achieve them.

Empowerment: learn controllability by maximizing I(a₁, a₂, ...; o_T).

Instead of considering single step or single target observation, consider trajectories:�I(a₁, a₂, …; f(o₁, o₂, …))

37 of 123

Partial observability and high-dimensional actions

38 of 123

The Journey is the Reward: Unsupervised Learning�of Influential Trajectories

Jonathan Binas, Sherjil Ozair, Yoshua Bengio

https://arxiv.org/abs/1905.09334

39 of 123

Poster Session #1

10:00 - 11:00

40 of 123

Invited Speaker:

Emo Todorov

11:00 - 11:30

41 of 123

Best Paper Awards

11:30 - 12:00

42 of 123

Best Paper Awards

Benchmarking Bonus-Based Exploration Methods on the Arcade Learning Environment. Adrien Ali Taiga, Marc G. Bellemare, Aaron Courville, Liam Fedus, Marlos C. Machado

Simple Regret Minimization for Contextual Bandits. Aniket Anand Deshmukh, Srinagesh Sharma, James Cutler, Mark Moldwin, Clayton Scott

43 of 123

Benchmarking Bonus-Based Exploration Methods on the Arcade Learning Environment.

Adrien Ali Taïga, William Fedus, Marlos C. Machado, Aaron Courville, Marc G. Bellemare

44 of 123

45 of 123

Arcade Learning Environment (ALE)

�Bellemare et. al, 2013

46 of 123

Deep Q-Networks

Hard exploration games

�Mnih et. al 2015

47 of 123

Bellemare, Srinivasan, Ostrovski, Schaul, Saxton, Munos (2016)

48 of 123

Recent improvements

Counts (Bellemare et al. 2016, Ostrovski et al. 2017, Machado et al. 2018)
Parameter Noise (Fortunato et al., Plappert et al.)
Thompson Sampling (Osband et al. 2016, Osband et al. 2018)
Prediction error (Pathak et al. 2017, Burda et al. 2018)

… and many more!

49 of 123

�VentureBeat, 2018

50 of 123

Look back at the progress made in exploration in the ALE.

51 of 123

Exploration method wish list

Sample efficient

Solves the environment quickly.�

Robust

Should not require tuning for every new environment.�

Scalable

The ALE is a building block towards harder and more complex problems.

52 of 123

Evaluating exploration methods

Comparisons done

With different learning algorithms

E.g MC-DQN (value-based) vs PPO (policy-based)�

Using different amounts of data (e.g 200M vs 2B frames).�
Evaluating only on hard exploration games

Reported results are hard to interpret!

53 of 123

Proposal

Benchmark recent exploration methods.�
We focus on bonus-based methods. ��

54 of 123

Evaluation methodology

One agent, Rainbow (Hessel et al. 2017)�
Hyperparameters of the learning algorithm are kept fixed.�
Each agent is trained for 200M frames.�
Agents are trained with sticky actions (Machado et al. 2017).�
Exploration methods are tuned on Montezuma’s Revenge

55 of 123

Methods studied

Bonus-based:

Pseudo counts

CTS; Bellemare et al. 2016, PixelCNN; Ostrovski et al. 2017

Intrinsic Curiosity Module (ICM; Pathak et al. 2017).
Random Network Distillation (RND; Burda et al. 2018).

Baselines:

ε-greedy exploration.
Noisy Networks (NoisyNets; Fortunato et al. 2017).

56 of 123

Empirical results

57 of 123

58 of 123

Evaluation

59 of 123

~6400

~5750

~12660

>5000

~46670

60 of 123

Evaluation

61 of 123

Easy exploration games

Atari training set:

Asterix - Seaquest - H.E.R.O - Space Invaders - Beam Rider - Freeway

62 of 123

63 of 123

Takeaways

64 of 123

65 of 123

Joint work with

66 of 123

Thank you for your attention!

67 of 123

Simple Regret Minimization for Contextual Bandits

Aniket Anand Deshmukh, Srinagesh Sharma, James Cutler, Mark Moldwin, Clayton Scott

68 of 123

Invited Speaker:

Pieter Abbeel

12:00 - 12:30

69 of 123

Lunch

12:30 - 14:00

70 of 123

Invited Speaker: Raia Hadsell

14:00 - 14:30

71 of 123

Lightning Talks

14:30 - 15:00

72 of 123

Format

12 speakers, 2 min each (no questions)

Please advance to next speaker’s slide.

Please hold applause until end.

73 of 123

Curious iLQR: Resolving Uncertainty in Model-based RL

Sarah Bechtle¹, Akshara Rai², Yixin Lin², Ludovic Righetti^1,3, Franziska Meier²

¹Max Planck Institute for Intelligent Systems

²Facebook AI Research

³New York University

74 of 123

Motivation:

Exploration in Model-based RL might help escape model bias.

Approach:

Learn a dynamics model that incorporates uncertainty
Include the model uncertainty in the optimization of the trajectory

→ seeking out actions that resolve uncertainty in the model leads to better performance and better model quality while staying close to the task to solve.

Evaluation:

7 DoF arm in simulation and on hardware for a free movement reaching task.

75 of 123

An Empirical and Conceptual Categorization of Value-based Exploration Methods

Niko Yasui¹, Sungsu Lim¹,

Cameron Linke¹, Adam White^{1 2}, Martha White¹

¹University of Alberta

²DeepMind

DQN:

Target network
Experience replay
Epsilon-greedy

etc...

76 of 123

Conceptual Properties

Q*

Q̃

(Approximate) upper confidence bounds

Q*

Thompson sampling

77 of 123

Simplifying Exploration

Linear Q-learning agent with fixed features

Environments

Misleading reward
High-variance reward
Large state-action space
Repulsive transition dynamics

78 of 123

Some takeaways

ε-greedy seems to do well when

Exploration is safe
Rewards are placed along a path near the optimal policy

Both optimism and posterior sampling appear to exhaustively sweep large state-action spaces
Need to understand interactions between representation learning and exploration

79 of 123

Skew-Fit: State-Covering Self-Supervised Reinforcement Learning

Vitchyr H. Pong*; Murtaza Dalal*; Steven Lin*; Ashvin V. Nair, Shikhar Bahl; Sergey Levine�UC Berkeley

80 of 123

Skew-Fit: State-Covering Self-Supervised Reinforcement Learning

Motivation: Formal objective for autonomous exploration and controllability

Idea: Want agent to set diverse goals for itself

Method: Learn maximum-entropy goal distribution even in unknow, high-dimensional state spaces

Evaluation: Image-based, robot manipulation tasks

code: https://github.com/vitchyr/rlkit/

In standard reinforcement learning, each new skill requires a manually-designed reward function, which takes considerable manual effort and engineering.

Exploration offers a learning signal in the absence of a reward function, but it doesn't result in a useful policy, that can achieve user-specified tasks after training.

Self-supervised goal exploration has the potential to automate exploration, while enabling an agent to distill its experience into a goal-conditioned policy, which can immediately reach user-specified goals after the autonomous learning phase.

We propose a formal exploration objective for goal-reaching policies that captures this desire to both explore and be controllable, and show that this objective is equivalent to maximizing the entropy of the goal distribution while simultaneously training a goal-conditioned policy.

We present an algorithm called Skew-Fit for learning such a maximum-entropy goal distribution, and show that under certain regularity conditions, our method converges to a uniform distribution over the set of possible states, even when we do not know this set beforehand.

Lastly, we combine SkewFit with a goal reaching method. We find that this overall method outperforms a number of prior works on numerous vision-based, robot manipulation tasks, including opening a door with a real robot, entirely from scratch and without any manually-designed reward function.

81 of 123

Optimistic Proximal Policy Optimization

Takahisa Imagawa (AIST)*

Takuya Hiraoka (NEC)

Yoshimasa Tsuruoka (The University of Tokyo)

82 of 123

Optimistic PPO: Motivations

uncertainty Bellman exploration (UBE) [1] is one of the methods �to alleviate the sparse reward problem

evaluates policy optimistically by the amount of the uncertainty of evaluation
more solid theoretical background than simply adding intrinsic reward �like random distillation network (RND) [2]
applied to SARSA, however there are more sophisticated algorithms �e.g proximal policy optimization (PPO) [3]

[1] Brendan O’Donoghue et.al. The uncertainty Bellman equation and exploration. �[2] Yuri Burda et.al. Exploration by random network distillation.�[3] John Schulman et.al. Proximal policy optimization algorithms.

Thus, we apply UBE to PPO and propose a new algorithm Optimistic PPO

83 of 123

Optimistic PPO: Experimental Results

high return

proposed

84 of 123

Exploration with Unreliable Intrinsic Reward in Multi-Agent Reinforcement Learning

Wendelin Boehmer

Tabish Rashid

Shimon Whiteson

Partial observable predator-prey task
Independent Q-learning slow

85 of 123

Exploration with Unreliable Intrinsic Reward in Multi-Agent Reinforcement Learning

Wendelin Boehmer

Tabish Rashid

Shimon Whiteson

Partial observable predator-prey task
Independent Q-learning slow
Intrinsic Reward introduces instability

86 of 123

Exploration with Unreliable Intrinsic Reward in Multi-Agent Reinforcement Learning

Wendelin Boehmer

Tabish Rashid

Shimon Whiteson

Introduce add. Exploration Agent
Training: same instability
Deployed: fast and stable

87 of 123

Parameterized Exploration

Jesse Clifton, Lili Wu, Eric Laber (North Carolina State University)

Problem: Exploration in finite horizon.
Intuitively, the agent explores less and less when approaching the end.
Idea: parameterizing exploration rate as a decreasing function of time t, and estimate its parameters using model-based rollouts.

88 of 123

Our exploration rate function

At each time t, estimate its parameters by solving

At time t, we can get an estimated environment.
In this estimated environment, different parameters will give different exploration rate sequence, then choose the best one.

89 of 123

Evidence (Normal MAB)

More evidence for Bernoulli-MAB, contextual bandit, MDP in the paper ...

90 of 123

Efficient Exploration in Side-Scrolling Video Games with Trajectory Replay

I-Huan Chiang*, Chung-Ming Huang,

Nien-Hu Cheng, Hsin-Yu Liu, Shi-Chun Tsai

(National Chiao Tung University,

Hsinchu, Taiwan)

91 of 123

Side-Scrolling Game

92 of 123

Scenarios in Side-Scrolling Games

Spring system

Moving platforms

Loop

Seesaw

93 of 123

Goal

Start

94 of 123

s

t

95 of 123

Penalty

96 of 123

TRM & TOM

Penalty

high

low

97 of 123

Idea

https://www.researchgate.net/figure/Trajectory-of-the-ball-with-a-shot-to-basket-The-technical-factors-that-govern-a-shot_fig1_315916518

98 of 123

Results

99 of 123

Thank you for your attention

thumbg12856.cs06g@g2.nctu.edu.tw

100 of 123

Hypothesis Driven Exploration for Deep Reinforcement Learning

Caleb Chuck (UT Austin)*; Supawit Chockchowwat (UT Austin);

Scott Niekum (UT Austin)

Explore by proposing and evaluating hypotheses about controllable object interactions, starting from raw pixels and raw actions

101 of 123

102 of 123

Learning latent state representation for speeding up exploration

Giulia Vezzani (Istituto Italiano di Tecnologia)*; Abhishek Gupta (UC Berkeley); Lorenzo Natale (Istituto Italiano di Tecnologia); Pieter Abbeel (UC Berkeley)

Representation learning applied to exploration
Prior experience to learn effective state representations
Entropy-based exploration method
Learned representation helps narrowing the search space.

103 of 123

Learning latent state representation for speeding up exploration

Maximum-entropy bonus for exploration

Multi-headed reward regression

More in details, we design a multi-headed framework for reward regression on prior tasks already solved. The network is designed so as to bottleneck the output of the shared layers and obtain a shared minimal representation across tasks with lower dimension than the entire state space. The idea underlying this structure is that shared layers should be able to learn what is important for predicting the reward function, regardless of the specific task. The latent variable should then represent that portion of the state responsible for experiencing the rewards.

When training a new task, we then use a entropy-based bonus on the learnt latent variable. With respect to standard entropy-based bonus computed over the entire state space, maximizing the entropy only over z is much more efficient because it represents the only portion of the state responsible for the rewards and has lower dimensionality with respect to the states.

104 of 123

Epistemic Risk-Sensitive Reinforcement Learning

Hannes Eriksson (Chalmers)*; Christos Dimitrakakis (Chalmers university of technology)

105 of 123

Utility functions

Epistemic risk

Risk that arises due to uncertainty about the model

Objective

One of the motivations behind this work is to figure out how to act under explicit uncertainties. For example, imagine you arrive at a policy pi, which upon following gives rise to the following return distribution. This could arise from for example, stochasticity of the policy, stochasticity of the environment, or in the model-based regime, from uncertainty about the model.

Now, if you want to be risk-averse with respect to the stochasticity of the environment, also known as aleatory risk, you might not only want to consider the expected return but other functions of it. Previous works have looked at variance, conditional value-at-risk, and similar, for aleatory risk. In this work we consider risk-aversion under exponential utility functions and Bayesian CVaR.

In this work we consider epistemic risk, that is the risk that arises due to uncertainty about the model. In the Bayesian RL setting, this uncertainty is prevalent throughout the whole learning process. We think it is especially important to consider this kind of risk during learning and exploration.

106 of 123

Key results

We propose a novel framework for dealing with epistemic risk with tuneable risk-parameter

We introduce three ways of acting under epistemic uncertainty

Extended Value Iteration under epistemic risk
Epistemic Risk-Sensitive policy gradient
Bayesian Epistemic CVaR

107 of 123

Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities

Aristide Tossou*; Debabrota Basu ; Christos Dimitrakakis

(Chalmers university of technology)

108 of 123

UCRL-V vs state-of-art UCRL2

UCRL2 (Jaksch et al., 2010) regret:

Lower bound (Jaksch et al., 2010) :

UCRL-V:

Same time complexity and Same settings

109 of 123

UCRL-V key ingredients

Variance based Confidence Interval for all subset of next states

New episode started when the average number of states doubled reaches 1

110 of 123

Improved Tree Search for Automatic Program Synthesis

Aran Carmon¹, Lior Wolf^1,2

¹Tel Aviv University

²Facebook AI Research (FAIR)

Using a variant of MCTS that leads to state of the art program synthesis results on two vastly different DSLs.

Multiple contributions:

a modified visit count with shared states
encoding the part of the program that was already executed
a preprocessing procedure for the training dataset

111 of 123

Program Synthesis

?Program?

Inputs

Outputs

Filter(<0)

Map(*2)

Sort()

Reverse()

Input: [1, -7, -14, 12, 8, 18, -11]

Output: [-14, -22, -28]

Deepcoder

While (rightIsNotClear) {

If (markerPresent) {

pickMarker()

}

move()

}

putMarker()

move()

Input

⟴ ⚐

⛝⛝⛝⛝⛝⛝

Output

⚐⟴

⛝⛝⛝⛝⛝⛝

Karel

112 of 123

Using the Policy Network with MCTS

4 Visits

+Visit

?

!

Init

✔

55%

40%

2%

Explored too much

Low probability assigned by network

Best candidate this time

?

Where P is the probability assigned by the network and N is the visits counts

Score = P / (N + 1)

1 Visit

0 Visits

+Visit

113 of 123

Partial Encoding

Dataset Pruning

Policy

Network

Next step should be:

Current environment state

Goal output state

Previous steps

LSTM

6%

Reverse

4%

Mul by 2

78%

Sort

…..

Original program	Shortened program:
While (noMarkersPresent) { putMarker() } move()	putMarker() move()
turnLeft() move() putMarker() Repeat (R=4) { turnLeft() } turnRight()	turnLeft() move() putMarker() turnRight()

114 of 123

DSL Benchmark

115 of 123

MuleX: Disentangling Exploration and Exploitation in Deep Reinforcement Learning

Lucas Beyer¹

Damien Vincent¹

Olivier Teboul²

Matthieu Geist²

Olivier Pietquin²

Google Brain ¹Zürich ²Paris

116 of 123

MuleX: Motivations

Most common way of doing exploration:

Q ↤ R_task + β₁R_bonus1 + β₂R_bonus2 + …

Works but has unfortunate consequences:

Changes the MDP we are solving!
Adds non-stationarity.
Will the final agent ever forget exploring?
Most likely trains longer than necessary.

117 of 123

MuleX: Approach

To explore or to exploit? … Why not both!?

Q_task ↤ R_task Q_bonus1 ↤ R_bonus1 Q_bonus2 ↤ R_bonus2 …

And learn from shared experience.

118 of 123

MuleX: Evaluation

119 of 123

Poster Session #2

15:00 - 16:00

120 of 123

Invited Speaker: Martha White

16:00 - 16:30

121 of 123

Panel Discussion

16:30 - 17:30

122 of 123

Panelists

Pieter Abbeel

Doina Precup

Martha

White

Jeff Clune

Pulkit Agrawal

123 of 123

Thanks!

Speakers and Panelists: Martha White, Emo Todorov, Raia Hadsell, Doina Precup, Pieter Abbeel, Jeff Clune, Pulkit Agrawal

Organizers: Ben Eysenbach, Surya Bhupatiraju, Shane Gu, Harri Edwards, Martha White, Pierre-Yves Oudeyer, Emma Brunskill, Kenneth Stanley

Website: Papers, slides, and recorded talks: https://sites.google.com/view/erl-2019/

Sponsors: