10-403 Recitation 5
Monte Carlo (MC)
2
Update Rule:
Incremental Update:
where return is the sum of discounted rewards:
DP (Value/Policy Iteration):
Monte Carlo Learning:
+ Collect samples from episodes
What does this
assume?
- mean return: unbiased
Pro or con?
DP vs MC
3
Monte Carlo (MC)
4
Q(St , At) ← average(Returns(St , At ))
How would you modify the �above to also generate a policy?
π(St) ← argmaxaQ(St , a)
Aggregate backwards
Temporal Difference Learning
5
+ Can learn before reaching a terminal state
+ Much more memory and computation-efficient than MC
- Using value in the target introduces bias
New update rule:
Motivation for TD learning and N-step returns
6
Approximate with v
MC estimate
Less reliance on v
N-step returns
TD(0)
N-step returns
N-step returns
7
Q-learning: Off-policy TD Learning
8
1-step Q-learning update:
Deep Q-learning
9
Tabular
Function Approximation
Target value
Prediction
Target Networks
10
Target value
Prediction
Experience Replay
11
Problem: Large State-Action Space
12
Trying to estimate the value at every state (solving the full MDP) is often infeasible
MC and TD still try to estimate Q/V value function for every state or state-action visited
Online Planning
13
Monte Carlo Tree Search (MCTS):
MCTS
14
node = state edge = action
Selection
15
Given:
Steps:
- keep executing UCB repeatedly until you reach frontier of tree (state that is not a node)
“children” = actions, don’t know all possible (s,a) → s’ transitions
s
a1
a1
s1
s
2
a
2
Expansion
16
Given:
- at a new state s not part of the tree
Steps:
- ex: if depth of this state < max depth
Why not store all nodes and Q values?
Simulation
17
Given:
Steps:
- ex: random policy
(Do Not store Q-value)
Backup
18
MCTS Overall
19
Monte Carlo Questions?
23
Topics (Lectures 1-8):
***Note this is not an exhaustive list. Anything covered in lectures in fair game.
What to expect?
Bandits
Greedy action:
Epsilon-greedy bandits
Upper confidence bound
Optimistic Initial Values
Gradient Bandit Algorithms
MDPs
State-value function:
Action-value function:
policy:
* denotes optimal
Policy evaluation
Bellman Backup & Contraction Mapping Theorem
Policy improvement
Policy iteration
Value iteration
Monte Carlo
Temporal Difference
Monte Carlo vs Temporal Difference
MCTS & DQN
(see other slides)
Questions?