2 of 41

Monte Carlo (MC)

Update Rule:

Incremental Update:

where return is the sum of discounted rewards:

DP (Value/Policy Iteration):

Iterate through all possibilities

Monte Carlo Learning:

+ Collect samples from episodes

What does this

assume?

assumes full knowledge of env
One step bootstrap: biased estimate

- mean return: unbiased

Pro or con?

4 of 41

Monte Carlo (MC)

Q(S_t, A_t) ← average(Returns(S_t, A_t))

How would you modify the �above to also generate a policy?

π(S_t) ← argmax_aQ(S_t, a)

Aggregate backwards

5 of 41

Temporal Diﬀerence Learning

+ Can learn before reaching a terminal state

+ Much more memory and computation-eﬃcient than MC

- Using value in the target introduces bias

New update rule:

6 of 41

Motivation for TD learning and N-step returns

Approximate with v

MC estimate

Less reliance on v

N-step returns

TD(0)

N-step returns

7 of 41

N-step returns

8 of 41

Q-learning: Oﬀ-policy TD Learning

Key beneﬁt: oﬀ-policy!
Only require state, action, reward, and next state drawn from the MDP
Doesn’t depend on the policy anywhere!
Is foundation for many sample-eﬃcient RL methods

1-step Q-learning update:

9 of 41

Deep Q-learning

What happens if the state space and action space are too large?

Use function approximation to approximate the Q-values!

Use gradient descent to take a step towards minimizing the Bellman error:

Tabular

Function Approximation

Target value

Prediction

10 of 41

Target Networks

One problem with deep Q-learning: nonstationary targets

Updating the network weights changes the target value, which requires more updates
Unintended generalization to other states S’ can lead to error propagation

Solution: calculate target values with a network that’s updated every T gradient steps

Network has more time to ﬁt targets accurately before they change
Slows down training, but not too many alternatives (recently: functional regularization)

Target value

Prediction

11 of 41

Experience Replay

Problem #1: neural networks undergo catastrophic forgetting if they haven’t been trained on a (similar) sample recently
Problem #2: online samples tend to be very correlated, which leads to unstable optimization
Solution: keep large history of transitions in a “replay buﬀer,” then optimize the Bellman error wrt random minibatches

12 of 41

Problem: Large State-Action Space

Trying to estimate the value at every state (solving the full MDP) is often infeasible

MC and TD still try to estimate Q/V value function for every state or state-action visited

Too much memory for tabular (10^48 states for chess)
NN may be undeﬁned at unseen states, “similar” states may have completely diﬀerent values and optimal paths

13 of 41

Online Planning

Use internal model to simulate trajectories at current state, ﬁnd the best one

Monte Carlo Tree Search (MCTS):

Only estimate value function for relevant part of state space
Consider only part of the full MDP at a given step

14 of 41

MCTS

node = state edge = action

Tree: Stores Q-values for only a subset of all state-actions
MC method: require episode termination to update values

15 of 41

Selection

Given:

current state of agent = root node
Empty or existing tree with Q-values

Steps:

- keep executing UCB repeatedly until you reach frontier of tree (state that is not a node)

“children” = actions, don’t know all possible (s,a) → s’ transitions

a₁

s₁

16 of 41

Expansion

Given:

- at a new state s not part of the tree

Steps:

Based on some rule, possibly add this new state to the tree

- ex: if depth of this state < max depth

Take random action a (since no Q-values available), receive reward r if available
G = Simulation(s, a)
Store Q(s, a) = gamma*G + r
return gamma*G + r to propagate return to parent node

Why not store all nodes and Q values?

17 of 41

Simulation

Given:

at a new state s not part of the tree

Steps:

If at terminal state, return reward
use very fast policy to determine action a to take

- ex: random policy

G = Simulation(s, a)
return gamma*G + r

(Do Not store Q-value)

18 of 41

Backup

Propagate return from the recursive calls
Calculate return at each state

19 of 41

MCTS Overall

For the current state of agent, repeatedly perform the previous steps until some criteria

ex: time limit
ex: Q-value convergence within some threshold

Execute the best action
Reuse the subtree of the successor state and repeat!

23 of 41

Monte Carlo Questions?

24 of 41

Topics (Lectures 1-8):

Introduction to Reinforcement and Representation Learning
Multi-Arm Bandits
MDPs
Value and Policy Iteration
Monte Carlo Learning
Temporal Difference Learning
Monte Carlo Tree Search
Deep Q learning

***Note this is not an exhaustive list. Anything covered in lectures in fair game.

25 of 41

What to expect?

Open note

Can only use downloaded content
No internet use

Types of questions:

True/False (with and without explanation)
Select all that apply (if explain - explain each choice why you did or did not select it)
Short answer
**If there is a box to explain, always explain your answer. Otherwise you will not get full credit.

26 of 41

Bandits

You have one state with k actions
Each action gets you a reward
Want to maximize reward in least amount of time

How to sample actions to do this efficiently?

Greedy action:

27 of 41

Epsilon-greedy bandits

28 of 41

Upper confidence bound

the square-root term is a measure of the uncertainty or variance in the estimate of a’s value
As Nt(a) increases the uncertainty term decreases.
On the other hand, each time an action other than a is selected, t increases but Nt(a) does not, causing the uncertainty to increase
The use of the natural logarithm means that the increases get smaller over time, but are unbounded

29 of 41

Optimistic Initial Values

Set initial Q values much higher than the reward
Encourage some exploration initially
Whichever actions are initially selected, the reward is less than the starting estimates; the learner switches to other actions, being “disappointed” with the rewards it is receiving. The result is that all actions are tried several times before the value estimates converge. The system does a fair amount of exploration even if greedy actions are selected all the time.

30 of 41

Gradient Bandit Algorithms

31 of 41

MDPs

State-value function:

Action-value function:

policy:

* denotes optimal

32 of 41

Policy evaluation

Find value function for a given policy
Converges to unique true value function in limit
In practice, use iterative policy evaluation (below) - stop when max delta below a threshold
Can update value function “in place” or use two copies

33 of 41

Bellman Backup & Contraction Mapping Theorem

value function v_pi is the unique solution to its Bellman equation.
Bellman backup operator is gamma-contraction

34 of 41

Policy improvement

Given value function for current policy, do one-step look-ahead and check if it is better to change policy to new action in each state
Strictly improving except when policy is already optimal

35 of 41

Policy iteration

36 of 41

Value iteration

Combine policy evaluation and policy iteration in each state sweep
Effectively combines one sweep of policy evaluation and one sweep of policy improvement.
Often much faster convergence than policy iteration

37 of 41

Monte Carlo

We do not assume complete knowledge of the environment.
Monte Carlo methods require only experience—sample sequences of states, actions, and rewards from actual or simulated interaction with an environment.
Average returns observed after visits to a state
Does not depend on estimates of other states (no bootstrapping)
Get rid of exploring starts:

on-policy (e.g. epsilon greedy),
off policy (i.e. importance sampling)

38 of 41

Temporal Difference

We do not assume complete knowledge of the environment.
TD methods require only experience—sample sequences of states, actions, and rewards from actual or simulated interaction with an environment.
Bootstrap from estimates of other states
Monte Carlo need to wait until end of episode, while TD(0) methods only need to wait one step
On-policy (i.e. SARSA), off-policy (i.e. Q-learning)

1 of 41

2 of 41

3 of 41

4 of 41

5 of 41

6 of 41

7 of 41

8 of 41

9 of 41

10 of 41

11 of 41

12 of 41

13 of 41

14 of 41

15 of 41

16 of 41

17 of 41

18 of 41

19 of 41

20 of 41

21 of 41

22 of 41

23 of 41

24 of 41

25 of 41

26 of 41

27 of 41

28 of 41

29 of 41

30 of 41

31 of 41

32 of 41

33 of 41

34 of 41

35 of 41

36 of 41

37 of 41

38 of 41

39 of 41

40 of 41

41 of 41