1 of 22

Quiz 3 Review 1

2 of 22

Part 1 of Quiz 3 Review

Lectures #18-22

A little more MBRL
Ideas for Intelligent Exploration
Sim2Real Transfer

⇒ Part 2 will cover the rest

3 of 22

Some more MBRL

4 of 22

Variational Autoencoders (VAE)

Can also condition the decoder on other variables (conditional VAE)

Check out

Slides for loss derivation
“Tutorial on Variational Autoencoders”

5 of 22

Dreamer

Train on imagined trajectories!

Learn a model of the environment (predict next state)

Act in the environment to get more observations for step 1

6 of 22

Discrete variables better capture multi-modal distributions

7 of 22

Intelligent Exploration

8 of 22

9 of 22

Exploration via modeling uncertainty of Q function

Model distribution itself (difficult)

10 of 22

Exploration via modeling uncertainty of Q function

11 of 22

State counting

Map a state to a hash code, then count up states visited with that hash code. Encourage visiting states with low count hash codes

Exploration A Study of Count-Based Exploration for Deep Reinforcement Learning,Tang et al.

12 of 22

Prediction error

Curiosity driven exploration with self-supervised prediction, Pathak et al.�Large-scale study of Curiosity-Driven Learning, Burda et al.

13 of 22

Reachability - episodic curiosity through reachability

14 of 22

Go-Explore: a New Approach for Hard-Exploration Problems

Derailment can occur when an agent has discovered a promising state and it would be beneficial to return to that state and explore from it.

IR causes agents to not want to return to those states to explore from there
To address derailment, an insight in Go-Explore is that effective exploration can be decomposed into first returning to a promising state (without intentionally adding any exploration) before then exploring further.

Failures of intrinsic motivation stem from two issues:

Detachment is the idea that an agent driven by intrinsic motivation could become detached from the frontiers of high intrinsic reward (IR).

Once IR is obtained, the agent will not remember how to get back to that location (catastrophic forgetting)
The Go-Explore algorithm addresses detachment by explicitly storing an archive of promising states visited so that they can then be revisited and explored from later.

15 of 22

Go-explore

Phase 1

(deterministic) Go to state in archive, then explore randomly, update archive with shortest path to that state -> replace existing if path got higher score or shorter path with same score
Sparsify states by downsampling image and use this for determining “same states”

Phase 2

Run IL on best trajectories from phase 1 to make policy more “robust”

16 of 22

Learning Montezuma’s Revenge from a Single Demonstration

RL is very sample inefficient especially in sparse reward settings (may never reach the reward)

IL also requires many demos to do well

This paper: learn from single demo in sparse reward setting by backtracking a small amount from the reward. Do this iteratively until at starting state.

As a result, taking uniformly random actions in Montezuma’s Revenge only produces a reward about once in every half a million steps. This exponential scaling of reinforcement learning severely limits the tasks current RL techniques can solve
A downside of this approach is that many different demonstrations are required for the agent to learn to generalize, and that these demonstrations need to be of high quality to avoid learning a sub-optimal solution. Here, we instead show that it is feasible to learn to solve sparse reward problems like Montezuma’s Revenge purely by RL, by bypassing the exploration problem through starting each episode from a carefully selected state from the demonstration.
Solution: By slowly moving the starting state from the end of the demonstration to the beginning, we ensure that at every point the agent faces an easy exploration problem where it is likely to succeed, since it has already learned to solve most of the remaining game.

17 of 22

Sim2Real Transfer

18 of 22

Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World, Tobin et al

19 of 22

Solving Rubik’s Cube with a Robot Hand

ADR: 1. gradually expand training environments (curriculum), 2. Removes need for manual domain randomization -> expansion based on performance

20 of 22

Driving Policy Transfer via Modularity and

Abstraction

Image of dashcam -> steering wheel controls…suppose you train fully in sim, then try to transfer this policy

→ lots of issues

Modular setup as opposed to end to end learning

Perception module to perform semantic segmentation

Generalizes better than RGB images

Driving policy predicts waypoints
Use low level PID control for actually performing action

Trained using behavior cloning/standard regression loss and evaluated on two different towns & weather conditions

X: end-to-end system that maps image pixels to low-level control (steering and throttle)

O: abstracted (policy operates on semantic map), modular (semantic map, policy waypoints, PID)

Perception (segmentation): The role of the perception system is to filter out nuisance factors and preserve the information needed for planning and control.

Supervised learning on a benchmark dataset

Policy is trained in simulation with fixed perception module.

Trained via behavior cloning of expert diving trajectories

21 of 22

RMA: Rapid Motor Adaptation for Legged Robots

22 of 22

RMA: Rapid Motor Adaptation for Legged Robots