1 of 87

Reinforcement Learning

Wenhao Yu

Google DeepMind

2 of 87

Reinforcement Learning

How did we teach Pupper to walk in lab 4?

3 of 87

Go

4 of 87

Video games

5 of 87

Robotics

6 of 87

LLM

7 of 87

Why study RL?

Core technical behind many modern ML breakthroughs

8 of 87

Why study RL?

Core technical behind many modern ML breakthroughs
Onward to create more breakthroughs

9 of 87

What we will learn today

What is RL?

10 of 87

What we will learn today

What is RL?
How to formulate a RL problem

11 of 87

What we will learn today

What is RL?
How to formulate a RL problem
How to solve a RL problem

12 of 87

What we will learn today

What is RL?
How to formulate a RL problem
How to solve a RL problem (Make our cute dog robot move!)

13 of 87

What is RL?

play go

play Starcraft

control robot

14 of 87

What is RL?

What’s common among these examples?

play go

play Starcraft

control robot

15 of 87

What is RL?

What’s common among these examples?

A task being performed
An agent/policy that get the current state of the world
The agent need to make decisions on what to do next
No clear answer for individual actions, but a reward for overall correctness

play go

play Starcraft

control robot

16 of 87

What is RL?

17 of 87

What is RL?

18 of 87

Markov Decision Process

19 of 87

Markov Decision Process

environment

20 of 87

Markov Decision Process

environment

agent/policy

21 of 87

Markov Decision Process

environment

agent/policy

state

e.g. Piece positions on the board

22 of 87

Markov Decision Process

environment

agent/policy

state

action

e.g. place the next piece

23 of 87

Markov Decision Process

environment

agent/policy

state

action

e.g. +1 point if win the game

reward

24 of 87

Markov Decision Process

environment

agent/policy

state

action

reward

termination

25 of 87

Markov Decision Process

environment

agent/policy

state

action

reward

Markov Decision Process (MDP)

termination

What RL tries to find

26 of 87

Quiz 1

If we want to create an AI that’s good at Super Mario

What’s the environment?
What’s the agent observing?
What actions can the agent take?
What’s the reward?
When is the trajectory terminated?

27 of 87

Quiz 2

We want to teach Pupper to walk forward

What’s the environment?
What’s the observation?
What actions can it take?
What might be the reward?
When do we terminate the trajectory?

28 of 87

How to formulate a RL problem

Given a MDP, RL aims to find the smartest that gets the most
But… how?

policy/agent

reward

29 of 87

How to solve RL?

Given a MDP, RL aims to find the smartest that gets the most
But… how?

policy/agent

reward

30 of 87

How to solve RL?

maximize

policy/agent

reward

Optimization!

31 of 87

How to solve RL?

maximize

policy/agent

reward

Optimization!

Objective Function

(The metric we want to optimize)

32 of 87

How to solve RL?

maximize

policy/agent

reward

Optimization!

Optimization argument

(The thing we want to get)

33 of 87

How to solve RL?

maximize

policy/agent

reward

Optimization!

t=0

34 of 87

How to solve RL?

maximize

policy/agent

reward

Optimization!

Go forward

r = 0

t=0

t=1

35 of 87

How to solve RL?

maximize

policy/agent

reward

Optimization!

Go forward

Jump

r = 0

r=100

t=0

t=1

t=2

36 of 87

How to solve RL?

maximize

policy/agent

reward

Optimization!

Go forward

Jump

r = 0

r=100

r=5000

t=0

t=1

t=2

t=T

5000

37 of 87

How to solve RL?

maximize

policy/agent

reward

Optimization!

Go forward

Jump

r = 0

r=100

r=5000

?

t=0

t=1

t=2

t=T

5000

38 of 87

How to solve RL?

maximize

policy/agent

reward

Optimization!

Go forward

Jump

r = 0

r=100

r=5000

t=0

t=1

t=2

t=T

39 of 87

How to solve RL?

maximize

policy/agent

Optimization!

Go forward

Jump

t=0

t=1

t=2

t=T

40 of 87

How to solve RL?

maximize

policy/agent

Optimization!

Go forward

Jump

t=0

t=1

t=2

t=T

?

policy/agent

41 of 87

How to solve RL?

maximize

policy/agent

Optimization!

Go forward

Jump

t=0

t=1

policy/agent

42 of 87

How to solve RL?

maximize

Optimization!

Go forward

Jump

t=0

t=1

43 of 87

How to solve RL?

maximize

Optimization!

44 of 87

How to solve RL?

maximize

Optimization!

Neural network parameters

45 of 87

How to solve RL?

maximize

Optimization!

How to solve this optimization problem? Hint: lab3

46 of 87

How to solve RL?

maximize

Optimization!

Gradient descent!

47 of 87

How to solve RL?

maximize

Optimization!

48 of 87

How to solve RL?

maximize

Optimization!

49 of 87

How to solve RL?

maximize

Optimization!

50 of 87

How to solve RL?

maximize

Optimization!

51 of 87

How to solve RL?

maximize

Optimization!

52 of 87