1 of 87

Reinforcement Learning

Wenhao Yu

Google DeepMind

2 of 87

Reinforcement Learning

How did we teach Pupper to walk in lab 4?

3 of 87

Go

4 of 87

Video games

5 of 87

Robotics

6 of 87

LLM

7 of 87

Why study RL?

  • Core technical behind many modern ML breakthroughs

8 of 87

Why study RL?

  • Core technical behind many modern ML breakthroughs
  • Onward to create more breakthroughs

9 of 87

What we will learn today

  • What is RL?

10 of 87

What we will learn today

  • What is RL?
  • How to formulate a RL problem

11 of 87

What we will learn today

  • What is RL?
  • How to formulate a RL problem
  • How to solve a RL problem

12 of 87

What we will learn today

  • What is RL?
  • How to formulate a RL problem
  • How to solve a RL problem (Make our cute dog robot move!)

13 of 87

What is RL?

play go

play Starcraft

control robot

14 of 87

What is RL?

  • What’s common among these examples?

play go

play Starcraft

control robot

15 of 87

What is RL?

  • What’s common among these examples?
    • A task being performed
    • An agent/policy that get the current state of the world
    • The agent need to make decisions on what to do next
    • No clear answer for individual actions, but a reward for overall correctness

play go

play Starcraft

control robot

16 of 87

What is RL?

17 of 87

What is RL?

18 of 87

Markov Decision Process

19 of 87

Markov Decision Process

environment

20 of 87

Markov Decision Process

environment

agent/policy

21 of 87

Markov Decision Process

environment

agent/policy

state

state

e.g. Piece positions on the board

22 of 87

Markov Decision Process

environment

agent/policy

state

state

action

action

e.g. place the next piece

23 of 87

Markov Decision Process

environment

agent/policy

state

state

action

action

e.g. +1 point if win the game

reward

reward

24 of 87

Markov Decision Process

environment

agent/policy

state

state

action

action

reward

reward

termination

25 of 87

Markov Decision Process

environment

agent/policy

state

action

reward

Markov Decision Process (MDP)

termination

 

 

 

 

 

What RL tries to find

26 of 87

Quiz 1

If we want to create an AI that’s good at Super Mario

  • What’s the environment?
  • What’s the agent observing?
  • What actions can the agent take?
  • What’s the reward?
  • When is the trajectory terminated?

27 of 87

Quiz 2

We want to teach Pupper to walk forward

  • What’s the environment?
  • What’s the observation?
  • What actions can it take?
  • What might be the reward?
  • When do we terminate the trajectory?

28 of 87

How to formulate a RL problem

  • Given a MDP, RL aims to find the smartest that gets the most
  • But… how?

policy/agent

reward

29 of 87

How to solve RL?

  • Given a MDP, RL aims to find the smartest that gets the most
  • But… how?

policy/agent

reward

30 of 87

How to solve RL?

maximize

policy/agent

reward

Optimization!

31 of 87

How to solve RL?

maximize

policy/agent

reward

Optimization!

Objective Function

(The metric we want to optimize)

32 of 87

How to solve RL?

maximize

policy/agent

reward

Optimization!

Optimization argument

(The thing we want to get)

33 of 87

How to solve RL?

maximize

policy/agent

reward

Optimization!

t=0

34 of 87

How to solve RL?

maximize

policy/agent

reward

Optimization!

Go forward

r = 0

t=0

t=1

35 of 87

How to solve RL?

maximize

policy/agent

reward

Optimization!

Go forward

Jump

r = 0

r=100

t=0

t=1

t=2

36 of 87

How to solve RL?

maximize

policy/agent

reward

Optimization!

Go forward

Jump

r = 0

r=100

r=5000

t=0

t=1

t=2

t=T

5000

37 of 87

How to solve RL?

maximize

policy/agent

reward

Optimization!

Go forward

Jump

r = 0

r=100

r=5000

?

?

?

t=0

t=1

t=2

t=T

5000

38 of 87

How to solve RL?

maximize

policy/agent

reward

Optimization!

Go forward

Jump

r = 0

r=100

r=5000

t=0

t=1

t=2

t=T

39 of 87

How to solve RL?

maximize

policy/agent

Optimization!

Go forward

Jump

t=0

t=1

t=2

t=T

 

40 of 87

How to solve RL?

maximize

policy/agent

Optimization!

Go forward

Jump

t=0

t=1

t=2

t=T

 

?

policy/agent

41 of 87

How to solve RL?

maximize

policy/agent

Optimization!

Go forward

Jump

t=0

t=1

 

policy/agent

42 of 87

How to solve RL?

maximize

Optimization!

Go forward

Jump

t=0

t=1

 

 

 

43 of 87

How to solve RL?

maximize

Optimization!

 

 

44 of 87

How to solve RL?

maximize

Optimization!

 

 

Neural network parameters

 

 

 

45 of 87

How to solve RL?

maximize

Optimization!

 

 

How to solve this optimization problem? Hint: lab3

46 of 87

How to solve RL?

maximize

Optimization!

 

 

Gradient descent!

47 of 87

How to solve RL?

maximize

Optimization!

 

 

 

48 of 87

How to solve RL?

maximize

Optimization!

 

 

 

49 of 87

How to solve RL?

maximize

Optimization!

 

 

 

50 of 87

How to solve RL?

maximize

Optimization!

 

 

 

51 of 87

How to solve RL?

maximize

Optimization!

 

 

 

52 of 87

Why is this gradient not easy to compute?

 

53 of 87

Why is this gradient not easy to compute?

 

54 of 87

Why is this gradient not easy to compute?

 

55 of 87

Why is this gradient not easy to compute?

 

56 of 87

Why is this gradient not easy to compute?

 

  1. Requires reward to be differentiable

57 of 87

Why is this gradient not easy to compute?

 

  1. Requires reward to be differentiable
  2. Requires dynamics to be differentiable

 

58 of 87

Why is this gradient not easy to compute?

r=100

Jump

How does a pixel change in image changes reward?

How does a pixel change in image when we change action slightly?

 

59 of 87

Stochastic Policy come to rescue

 

 

Deterministic policy:

Stochastic policy:

jump

 

forward

 

jump

60 of 87

Stochastic Policy come to rescue

maximize

 

 

Deterministic policy:

Stochastic policy:

61 of 87

Stochastic Policy come to rescue

maximize

 

 

maximize

 

 

Deterministic policy:

Stochastic policy:

Sampled trajectories

62 of 87

Policy Gradient

 

63 of 87

Policy Gradient

 

How to make a trajectory more likely

64 of 87

Policy Gradient

 

How to make a trajectory more likely

Total reward in a trajectory

x

65 of 87

Policy Gradient

 

How to make a trajectory more likely

Total reward in a trajectory

x

Average over many trajectories

66 of 87

Policy Gradient Algorithm

 

How to make a trajectory more likely

Total reward in a trajectory

x

Average over many trajectories

 

2. Compute gradient:

 

67 of 87

Problem with simple Policy Gradient

  1. Require a lot of samples to get good gradient.
  2. Old sample becomes useless once policy changes.

68 of 87

Problem with simple Policy Gradient

  1. Require a lot of samples to get good gradient.
  2. Old sample becomes useless once policy changes.

What if …

  1. we can use fewer samples to get good gradient?

  • we can re-use old samples form previous policies?

69 of 87

Problem with simple Policy Gradient

  1. Require a lot of samples to get good gradient.
  2. Old sample becomes useless once policy changes.

What if …

  1. we can use fewer samples to get good gradient?

  • we can re-use old samples form previous policies?

Generalized Advantage Estimator (GAE)!

70 of 87

Problem with simple Policy Gradient

  1. Require a lot of samples to get good gradient.
  2. Old sample becomes useless once policy changes.

What if …

  1. we can use fewer samples to get good gradient?

  • we can re-use old samples form previous policies?

Generalized Advantage Estimator (GAE)!

Trust-Region Policy Optimization (TRPO)!

Proximal Policy Optimization (PPO)!

71 of 87

What have we learned today?

  • What is RL
      • Use Markov Decision Process (MDP) to formulate RL problems
  • Solve the formulated RL problem using optimization
      • Why is it difficult to solve it using deterministic policy
      • Policy Gradient algorithm using stochastic policy

72 of 87

A side note

  • There are many good ways to interpret and understand RL.
  • Best way to understand it is to learn it from different perspectives.
  • We learn one of them today, that hopefully augments other materials.
  • Other useful resources:

73 of 87

Lab 5 preliminary

  • Sim2real and Accelerator-based sim

74 of 87

Sim2real and Accelerator-based sim

Zhuang et al, 2023

Cheng et al, 2023

75 of 87

Sim2real and Accelerator-based sim

Zhuang et al, 2023

Cheng et al, 2023

~3 Billion control steps in training!

750 days!

76 of 87

Sim2real and Accelerator-based sim

PyBullet

IsaacGym

MuJoCo/MJX

Dart

RaiSim

77 of 87

Sim2Real

78 of 87

Sim2Real

79 of 87

Sim2Real

  • Sim2Real gap:

80 of 87

Sim2Real

  • Sim2Real gap:
      • Image looks different
      • Dynamics is different

81 of 87

Sim2Real

  • Sim2Real gap:
      • Can we make simulation the same as real-world?

82 of 87

Sim2Real

  • Sim2Real gap:
      • Can we make simulation the same as real-world?
          • Yes, to some extent.

https://sites.google.com/corp/view/nerf2real/home

83 of 87

Sim2Real – Domain Randomization

Train with different:

  • Mass
  • Friction
  • Motor strengths
  • Latency

84 of 87

Sim2Real – Domain Randomization

Train with different:

  • Mass
  • Friction
  • Motor strengths
  • Latency

85 of 87

Sim2Real – Domain Randomization

Train with different:

  • Mass
  • Friction
  • Motor strengths
  • Latency

86 of 87

Sim2Real for Locomotion

2018

0.5 sim day

Sim-to-real: Learning agile locomotion for quadruped robots

2019

Learning Agile and Dynamic Motor Skills

for Legged Robots

9 sim day

2021

Visual-Locomotion: Learning to Walk on Complex Terrains with Vision

120 sim day

Robot Parkour Learning

2023

750 sim day

87 of 87

Accelerator-based Parallel-simulation

CPU-based

GPU-based

500x faster