1 of 22

Quiz 3 Review 1

2 of 22

Part 1 of Quiz 3 Review

Lectures #18-22

  • A little more MBRL
  • Ideas for Intelligent Exploration
  • Sim2Real Transfer

⇒ Part 2 will cover the rest

3 of 22

Some more MBRL

4 of 22

Variational Autoencoders (VAE)

Can also condition the decoder on other variables (conditional VAE)

Check out

  • Slides for loss derivation
  • Tutorial on Variational Autoencoders

5 of 22

Dreamer

Train on imagined trajectories!

Learn a model of the environment (predict next state)

Act in the environment to get more observations for step 1

6 of 22

Discrete variables better capture multi-modal distributions

7 of 22

Intelligent Exploration

8 of 22

9 of 22

Exploration via modeling uncertainty of Q function

Model distribution itself (difficult)

10 of 22

Exploration via modeling uncertainty of Q function

11 of 22

State counting

Map a state to a hash code, then count up states visited with that hash code. Encourage visiting states with low count hash codes

Exploration A Study of Count-Based Exploration for Deep Reinforcement Learning,Tang et al.

12 of 22

Prediction error

Curiosity driven exploration with self-supervised prediction, Pathak et al.�Large-scale study of Curiosity-Driven Learning, Burda et al.

13 of 22

Reachability - episodic curiosity through reachability

14 of 22

Go-Explore: a New Approach for Hard-Exploration Problems

Derailment can occur when an agent has discovered a promising state and it would be beneficial to return to that state and explore from it.

    • IR causes agents to not want to return to those states to explore from there
    • To address derailment, an insight in Go-Explore is that effective exploration can be decomposed into first returning to a promising state (without intentionally adding any exploration) before then exploring further.

Failures of intrinsic motivation stem from two issues:

Detachment is the idea that an agent driven by intrinsic motivation could become detached from the frontiers of high intrinsic reward (IR).

    • Once IR is obtained, the agent will not remember how to get back to that location (catastrophic forgetting)
    • The Go-Explore algorithm addresses detachment by explicitly storing an archive of promising states visited so that they can then be revisited and explored from later.

15 of 22

Go-explore

  1. Phase 1
    1. (deterministic) Go to state in archive, then explore randomly, update archive with shortest path to that state -> replace existing if path got higher score or shorter path with same score
    2. Sparsify states by downsampling image and use this for determining “same states”
  2. Phase 2
    • Run IL on best trajectories from phase 1 to make policy more “robust”

16 of 22

Learning Montezuma’s Revenge from a Single Demonstration

  • RL is very sample inefficient especially in sparse reward settings (may never reach the reward)

  • IL also requires many demos to do well

  • This paper: learn from single demo in sparse reward setting by backtracking a small amount from the reward. Do this iteratively until at starting state.

17 of 22

Sim2Real Transfer

18 of 22

Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World, Tobin et al

19 of 22

Solving Rubik’s Cube with a Robot Hand

ADR: 1. gradually expand training environments (curriculum), 2. Removes need for manual domain randomization -> expansion based on performance

20 of 22

Driving Policy Transfer via Modularity and

Abstraction

21 of 22

RMA: Rapid Motor Adaptation for Legged Robots

22 of 22

RMA: Rapid Motor Adaptation for Legged Robots