1 of 33

Dynamic Programming for Reinforcement Learning

Lecture 6

9/21/20

2 of 33

Administravia

HW1 due on Monday (9/28)

One submission per team on Gradescope

Quiz 1 a week from Friday (10/2)

Make sure to review the HW questions and the proposed quiz questions

Questions?

3 of 33

Plan for Today

Goal: Understand how to use dynamic programming to solve RL problems

Special case: shortest path problems

Problem solving
Shortest path distances
A dynamic programming algorithm

From shortest paths to maximum reward

Problem solving
What are Q functions are how do we learn them?
What are the Bellman equations and why are they useful?
Our first RL algorithm: Policy Iteration

4 of 33

Puzzle Solving in Groups of 3

When you're done, click "YES" in the Zoom window:

5 of 33



		done

In each cell, write the number of steps to reach the done state.

6 of 33






		done

In each cell, write the number of steps to reach the done state.

How would you write an algorithm to solve this puzzle?

7 of 33






		done

In each cell, drawn an arrow indicating the which direction to move to reach the done state fastest.

Then, write the number of steps to reach the goal in each state.

How would you write an algorithm to solve this puzzle?

8 of 33



		done

(OPTIONAL)

In this example, there are multiple arrows leaving each cell. The agent randomly chooses one of them. In each cell, write the expected number of steps to reach the done state.

9 of 33

STOP

10 of 33

Shortest Path Algorithms

shortest path distance from s, if you start by taking action a.

Two properties of shortest path distances:



		done

11 of 33

A Shortest Path Algorithm

Initialize

While not converged:

For all s, a:

12 of 33

Shortest Path == Reinforcement Learning

Maximizing the total reward is equivalent to minimizing the number of steps.

But what if we are using a different reward function?

13 of 33

Plan for Today

Goal: Understand how to use dynamic programming to solve RL problems

Special case: shortest path problems

Problem solving
Shortest path distances
A dynamic programming algorithm

From shortest paths to maximum reward

Problem solving
What are Q functions are how do we learn them?
What are the Bellman equations and why are they useful?
Our first RL algorithm: Policy Iteration

14 of 33

15 of 33

Side note: reward function design is really hard

Learning by Playing – Solving Sparse Reward Tasks from Scratch, Riedmiller et al 2018

16 of 33

Puzzle Solving in Groups of 3

When you're done, click "YES" in the Zoom window:

17 of 33

In each cell, write the expected future total reward.



		done

+2

+3

+1

-3

-5

+1

18 of 33

-1	-1	0	0	-1
0		-3		0
-1		-2		-1
0	-1	0	-10	0
				-1
-1	-2	done	-2	0

In each cell, drawn an arrow indicating the which direction to achieve the largest reward.

Then, write the reward in each state.

How would you write an algorithm to solve this puzzle?

19 of 33

In each cell, write the expected future total reward.

How would you write an algorithm to solve this puzzle?



		done

-1

-3

-2

-3

-5

20 of 33

STOP

21 of 33

What is a Q-function?

Answer: Expected total reward, starting from a given state and action

Intuition: Q functions capture information about how good each state and action are for achieving the agent's goal

22 of 33

What is a value function?

Answer: The expected Q value.

Intuition:

Q(s, a) tells you the "goodness" a state and action
V(s) just tells you the "goodness" of a state.

23 of 33

Unrolling one step

"Total reward" = "reward now" + "reward later"

24 of 33

Learning the Q-function

Initialize

While not converged

For all states and actions

-

25 of 33

How do we know when we're done?

Answer: Use the Bellman Equations

Another version:

26 of 33

Alternative approach to learning Q(s, a)

Approach 1: Dynamic programming:

Approach 2:

Observation: Bellman equations are a set of linear equations

…

27 of 33

Alternative approach to learning Q(s, a)

Approach 1: Dynamic programming:

Approach 2: View Bellman equations are a set of linear equations

Q

R

T

Q

...and solve via least squares or linear program

28 of 33

Q-functions are Policies in Disguise

Observation: An optimal policy will always choose the action that has highest expected reward:

Thus, we only need to learn the Q function, not the policy!

29 of 33

Our First RL Algorithm: Policy Iteration

Learn the Q function

Act Greedily

Policy Evaluation

"How good is my policy?"

Policy Improvement

30 of 33

How do we know when we're done?

Answer: Use the Bellman Optimality Equations

"Total reward" = "reward now" + "future reward if you act greedily"

Notation:

Estimated reward of a particular policy
Estimated reward of the optimal policy

31 of 33

Does Policy Iteration Guarantee Improvement?

Might acting greedily result in a worse policy?

Observation: Acting greedily increases the Q function

Claim: the expected reward (value) of the new policy is at least as large as the expected reward (value) of the old policy:

32 of 33

Policy Improvement Theorem

33 of 33

Plan for Today

Goal: Understand how to use dynamic programming to solve RL problems

Special case: shortest path problems

Problem solving
Shortest path distances
A dynamic programming algorithm

From shortest paths to maximum reward

Problem solving
What are Q functions are how do we learn them?
What are the Bellman equations and why are they useful?
Our first RL algorithm: Policy Iteration