Reinforcement Learning
Wenhao Yu
Google DeepMind
Reinforcement Learning
How did we teach Pupper to walk in lab 4?
Go
Video games
Robotics
LLM
Why study RL?
Why study RL?
What we will learn today
What we will learn today
What we will learn today
What we will learn today
What is RL?
play go
play Starcraft
control robot
What is RL?
play go
play Starcraft
control robot
What is RL?
play go
play Starcraft
control robot
What is RL?
What is RL?
Markov Decision Process
Markov Decision Process
environment
Markov Decision Process
environment
agent/policy
Markov Decision Process
environment
agent/policy
state
state
e.g. Piece positions on the board
Markov Decision Process
environment
agent/policy
state
state
action
action
e.g. place the next piece
Markov Decision Process
environment
agent/policy
state
state
action
action
e.g. +1 point if win the game
reward
reward
Markov Decision Process
environment
agent/policy
state
state
action
action
reward
reward
termination
Markov Decision Process
environment
agent/policy
state
action
reward
Markov Decision Process (MDP)
termination
What RL tries to find
Quiz 1
If we want to create an AI that’s good at Super Mario
Quiz 2
We want to teach Pupper to walk forward
How to formulate a RL problem
policy/agent
reward
How to solve RL?
policy/agent
reward
How to solve RL?
maximize
policy/agent
reward
Optimization!
How to solve RL?
maximize
policy/agent
reward
Optimization!
Objective Function
(The metric we want to optimize)
How to solve RL?
maximize
policy/agent
reward
Optimization!
Optimization argument
(The thing we want to get)
How to solve RL?
maximize
policy/agent
reward
Optimization!
t=0
How to solve RL?
maximize
policy/agent
reward
Optimization!
Go forward
r = 0
t=0
t=1
How to solve RL?
maximize
policy/agent
reward
Optimization!
Go forward
Jump
r = 0
r=100
t=0
t=1
t=2
How to solve RL?
maximize
policy/agent
reward
Optimization!
Go forward
Jump
r = 0
r=100
r=5000
t=0
t=1
t=2
t=T
5000
How to solve RL?
maximize
policy/agent
reward
Optimization!
Go forward
Jump
r = 0
r=100
r=5000
?
?
?
t=0
t=1
t=2
t=T
5000
How to solve RL?
maximize
policy/agent
reward
Optimization!
Go forward
Jump
r = 0
r=100
r=5000
t=0
t=1
t=2
t=T
How to solve RL?
maximize
policy/agent
Optimization!
Go forward
Jump
t=0
t=1
t=2
t=T
How to solve RL?
maximize
policy/agent
Optimization!
Go forward
Jump
t=0
t=1
t=2
t=T
?
policy/agent
How to solve RL?
maximize
policy/agent
Optimization!
Go forward
Jump
t=0
t=1
policy/agent
How to solve RL?
maximize
Optimization!
Go forward
Jump
t=0
t=1
How to solve RL?
maximize
Optimization!
How to solve RL?
maximize
Optimization!
Neural network parameters
How to solve RL?
maximize
Optimization!
How to solve this optimization problem? Hint: lab3
How to solve RL?
maximize
Optimization!
Gradient descent!
How to solve RL?
maximize
Optimization!
How to solve RL?
maximize
Optimization!
How to solve RL?
maximize
Optimization!
How to solve RL?
maximize
Optimization!
How to solve RL?
maximize
Optimization!
Why is this gradient not easy to compute?
Why is this gradient not easy to compute?
Why is this gradient not easy to compute?
Why is this gradient not easy to compute?
Why is this gradient not easy to compute?
Why is this gradient not easy to compute?
Why is this gradient not easy to compute?
r=100
Jump
How does a pixel change in image changes reward?
How does a pixel change in image when we change action slightly?
Stochastic Policy come to rescue
Deterministic policy:
Stochastic policy:
jump
forward
jump
Stochastic Policy come to rescue
maximize
Deterministic policy:
Stochastic policy:
Stochastic Policy come to rescue
maximize
maximize
Deterministic policy:
Stochastic policy:
Sampled trajectories
Policy Gradient
Policy Gradient
How to make a trajectory more likely
Policy Gradient
How to make a trajectory more likely
Total reward in a trajectory
x
Policy Gradient
How to make a trajectory more likely
Total reward in a trajectory
x
Average over many trajectories
Policy Gradient Algorithm
How to make a trajectory more likely
Total reward in a trajectory
x
Average over many trajectories
2. Compute gradient:
Problem with simple Policy Gradient
Problem with simple Policy Gradient
What if …
Problem with simple Policy Gradient
What if …
Generalized Advantage Estimator (GAE)!
Problem with simple Policy Gradient
What if …
Generalized Advantage Estimator (GAE)!
Trust-Region Policy Optimization (TRPO)!
Proximal Policy Optimization (PPO)!
What have we learned today?
A side note
Lab 5 preliminary
Sim2real and Accelerator-based sim
Zhuang et al, 2023
Cheng et al, 2023
Sim2real and Accelerator-based sim
Zhuang et al, 2023
Cheng et al, 2023
~3 Billion control steps in training!
750 days!
Sim2real and Accelerator-based sim
PyBullet
IsaacGym
MuJoCo/MJX
Dart
RaiSim
Sim2Real
Sim2Real
Sim2Real
Sim2Real
Sim2Real
Sim2Real
https://sites.google.com/corp/view/nerf2real/home
Sim2Real – Domain Randomization
Train with different:
Sim2Real – Domain Randomization
Train with different:
Sim2Real – Domain Randomization
Train with different:
Sim2Real for Locomotion
2018
0.5 sim day
Sim-to-real: Learning agile locomotion for quadruped robots
2019
Learning Agile and Dynamic Motor Skills
for Legged Robots
9 sim day
2021
Visual-Locomotion: Learning to Walk on Complex Terrains with Vision
120 sim day
Robot Parkour Learning
2023
750 sim day
Accelerator-based Parallel-simulation
CPU-based
GPU-based
500x faster