Dynamic Programming for Reinforcement Learning
Lecture 6
9/21/20
Administravia
Plan for Today
Goal: Understand how to use dynamic programming to solve RL problems
Puzzle Solving in Groups of 3
When you're done, click "YES" in the Zoom window:
| | |
| | |
| | done |
In each cell, write the number of steps to reach the done state.
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | done | | |
In each cell, write the number of steps to reach the done state.
How would you write an algorithm to solve this puzzle?
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | done | | |
In each cell, drawn an arrow indicating the which direction to move to reach the done state fastest.
Then, write the number of steps to reach the goal in each state.
How would you write an algorithm to solve this puzzle?
| | |
| | |
| | done |
(OPTIONAL)
In this example, there are multiple arrows leaving each cell. The agent randomly chooses one of them. In each cell, write the expected number of steps to reach the done state.
STOP
Shortest Path Algorithms
shortest path distance from s, if you start by taking action a.
Two properties of shortest path distances:
| | |
| | |
| | done |
A Shortest Path Algorithm
Initialize
While not converged:
For all s, a:
Shortest Path == Reinforcement Learning
Maximizing the total reward is equivalent to minimizing the number of steps.
But what if we are using a different reward function?
Plan for Today
Goal: Understand how to use dynamic programming to solve RL problems
Side note: reward function design is really hard
Learning by Playing – Solving Sparse Reward Tasks from Scratch, Riedmiller et al 2018
Puzzle Solving in Groups of 3
When you're done, click "YES" in the Zoom window:
In each cell, write the expected future total reward.
| | |
| | |
| | done |
+2
+3
+1
-3
-5
+1
-1 | -1 | 0 | 0 | -1 |
0 | | -3 | | 0 |
-1 | | -2 | | -1 |
0 | -1 | 0 | -10 | 0 |
| | | | -1 |
-1 | -2 | done | -2 | 0 |
In each cell, drawn an arrow indicating the which direction to achieve the largest reward.
Then, write the reward in each state.
How would you write an algorithm to solve this puzzle?
In each cell, write the expected future total reward.
How would you write an algorithm to solve this puzzle?
| | |
| | |
| | done |
-1
-3
-2
-2
-3
-5
STOP
What is a Q-function?
Answer: Expected total reward, starting from a given state and action
Intuition: Q functions capture information about how good each state and action are for achieving the agent's goal
What is a value function?
Answer: The expected Q value.
Intuition:
Unrolling one step
"Total reward" = "reward now" + "reward later"
Learning the Q-function
Initialize
While not converged
For all states and actions
-
How do we know when we're done?
Answer: Use the Bellman Equations
Another version:
Alternative approach to learning Q(s, a)
Approach 1: Dynamic programming:
Approach 2:
Observation: Bellman equations are a set of linear equations
…
…
Alternative approach to learning Q(s, a)
Approach 1: Dynamic programming:
Approach 2: View Bellman equations are a set of linear equations
Q
R
T
Q
...and solve via least squares or linear program
Q-functions are Policies in Disguise
Observation: An optimal policy will always choose the action that has highest expected reward:
Thus, we only need to learn the Q function, not the policy!
Our First RL Algorithm: Policy Iteration
Learn the Q function
Act Greedily
Policy Evaluation
"How good is my policy?"
Policy Improvement
How do we know when we're done?
Answer: Use the Bellman Optimality Equations
"Total reward" = "reward now" + "future reward if you act greedily"
Notation:
Does Policy Iteration Guarantee Improvement?
Might acting greedily result in a worse policy?
Observation: Acting greedily increases the Q function
Claim: the expected reward (value) of the new policy is at least as large as the expected reward (value) of the old policy:
Policy Improvement Theorem
Plan for Today
Goal: Understand how to use dynamic programming to solve RL problems