1 of 41

10-403 Recitation 5

2 of 41

Monte Carlo (MC)

2

Update Rule:

Incremental Update:

where return is the sum of discounted rewards:

DP (Value/Policy Iteration):

  • Iterate through all possibilities

Monte Carlo Learning:

+ Collect samples from episodes

What does this

assume?

  • assumes full knowledge of env
  • One step bootstrap: biased estimate

- mean return: unbiased

Pro or con?

3 of 41

DP vs MC

3

4 of 41

Monte Carlo (MC)

4

Q(St , At) ← average(Returns(St , At ))

How would you modify the �above to also generate a policy?

π(St) ← argmaxaQ(St , a)

Aggregate backwards

5 of 41

Temporal Difference Learning

5

+ Can learn before reaching a terminal state

+ Much more memory and computation-efficient than MC

- Using value in the target introduces bias

New update rule:

6 of 41

Motivation for TD learning and N-step returns

6

Approximate with v

MC estimate

Less reliance on v

N-step returns

TD(0)

N-step returns

7 of 41

N-step returns

7

8 of 41

Q-learning: Off-policy TD Learning

8

  • Key benefit: off-policy!
  • Only require state, action, reward, and next state drawn from the MDP
  • Doesn’t depend on the policy anywhere!
  • Is foundation for many sample-efficient RL methods

1-step Q-learning update:

9 of 41

Deep Q-learning

9

  • What happens if the state space and action space are too large?
    • Use function approximation to approximate the Q-values!
  • Use gradient descent to take a step towards minimizing the Bellman error:

Tabular

Function Approximation

Target value

Prediction

10 of 41

Target Networks

10

  • One problem with deep Q-learning: nonstationary targets
    • Updating the network weights changes the target value, which requires more updates
    • Unintended generalization to other states S’ can lead to error propagation
  • Solution: calculate target values with a network that’s updated every T gradient steps
    • Network has more time to fit targets accurately before they change
    • Slows down training, but not too many alternatives (recently: functional regularization)

Target value

Prediction

11 of 41

Experience Replay

11

  • Problem #1: neural networks undergo catastrophic forgetting if they haven’t been trained on a (similar) sample recently
  • Problem #2: online samples tend to be very correlated, which leads to unstable optimization
  • Solution: keep large history of transitions in a “replay buffer,” then optimize the Bellman error wrt random minibatches

12 of 41

Problem: Large State-Action Space

12

Trying to estimate the value at every state (solving the full MDP) is often infeasible

MC and TD still try to estimate Q/V value function for every state or state-action visited

  • Too much memory for tabular (10^48 states for chess)
  • NN may be undefined at unseen states, “similar” states may have completely different values and optimal paths

13 of 41

Online Planning

13

  • Use internal model to simulate trajectories at current state, find the best one

Monte Carlo Tree Search (MCTS):

  • Only estimate value function for relevant part of state space
  • Consider only part of the full MDP at a given step

14 of 41

MCTS

14

node = state edge = action

  • Tree: Stores Q-values for only a subset of all state-actions
  • MC method: require episode termination to update values

15 of 41

Selection

15

Given:

  • current state of agent = root node
  • Empty or existing tree with Q-values

Steps:

- keep executing UCB repeatedly until you reach frontier of tree (state that is not a node)

“children” = actions, don’t know all possible (s,a) s’ transitions

s

a1

a1

s1

s

2

a

2

16 of 41

Expansion

16

Given:

- at a new state s not part of the tree

Steps:

  • Based on some rule, possibly add this new state to the tree

- ex: if depth of this state < max depth

  • Take random action a (since no Q-values available), receive reward r if available
  • G = Simulation(s, a)
  • Store Q(s, a) = gamma*G + r
  • return gamma*G + r to propagate return to parent node

Why not store all nodes and Q values?

17 of 41

Simulation

17

Given:

  • at a new state s not part of the tree

Steps:

  • If at terminal state, return reward
  • use very fast policy to determine action a to take

- ex: random policy

  • G = Simulation(s, a)
  • return gamma*G + r

(Do Not store Q-value)

18 of 41

Backup

18

  • Propagate return from the recursive calls
  • Calculate return at each state

19 of 41

MCTS Overall

19

  • For the current state of agent, repeatedly perform the previous steps until some criteria
    • ex: time limit
    • ex: Q-value convergence within some threshold
  • Execute the best action
  • Reuse the subtree of the successor state and repeat!

20 of 41

21 of 41

22 of 41

23 of 41

Monte Carlo Questions?

23

24 of 41

Topics (Lectures 1-8):

  1. Introduction to Reinforcement and Representation Learning
  2. Multi-Arm Bandits
  3. MDPs
  4. Value and Policy Iteration
  5. Monte Carlo Learning
  6. Temporal Difference Learning
  7. Monte Carlo Tree Search
  8. Deep Q learning

***Note this is not an exhaustive list. Anything covered in lectures in fair game.

25 of 41

What to expect?

  • Open note
    • Can only use downloaded content
    • No internet use
  • Types of questions:
    • True/False (with and without explanation)
    • Select all that apply (if explain - explain each choice why you did or did not select it)
    • Short answer
    • **If there is a box to explain, always explain your answer. Otherwise you will not get full credit.

26 of 41

Bandits

  • You have one state with k actions
  • Each action gets you a reward
  • Want to maximize reward in least amount of time
    • How to sample actions to do this efficiently?

Greedy action:

27 of 41

Epsilon-greedy bandits

28 of 41

Upper confidence bound

  • the square-root term is a measure of the uncertainty or variance in the estimate of a’s value
  • As Nt(a) increases the uncertainty term decreases.
  • On the other hand, each time an action other than a is selected, t increases but Nt(a) does not, causing the uncertainty to increase
  • The use of the natural logarithm means that the increases get smaller over time, but are unbounded

29 of 41

Optimistic Initial Values

  • Set initial Q values much higher than the reward
  • Encourage some exploration initially
  • Whichever actions are initially selected, the reward is less than the starting estimates; the learner switches to other actions, being “disappointed” with the rewards it is receiving. The result is that all actions are tried several times before the value estimates converge. The system does a fair amount of exploration even if greedy actions are selected all the time.

30 of 41

Gradient Bandit Algorithms

31 of 41

MDPs

State-value function:

Action-value function:

policy:

* denotes optimal

32 of 41

Policy evaluation

  • Find value function for a given policy
  • Converges to unique true value function in limit
  • In practice, use iterative policy evaluation (below) - stop when max delta below a threshold
  • Can update value function “in place” or use two copies

33 of 41

Bellman Backup & Contraction Mapping Theorem

  • value function v_pi is the unique solution to its Bellman equation.
  • Bellman backup operator is gamma-contraction

34 of 41

Policy improvement

  • Given value function for current policy, do one-step look-ahead and check if it is better to change policy to new action in each state
  • Strictly improving except when policy is already optimal

35 of 41

Policy iteration

36 of 41

Value iteration

  • Combine policy evaluation and policy iteration in each state sweep
  • Effectively combines one sweep of policy evaluation and one sweep of policy improvement.
  • Often much faster convergence than policy iteration

37 of 41

Monte Carlo

  • We do not assume complete knowledge of the environment.
  • Monte Carlo methods require only experience—sample sequences of states, actions, and rewards from actual or simulated interaction with an environment.
  • Average returns observed after visits to a state
  • Does not depend on estimates of other states (no bootstrapping)
  • Get rid of exploring starts:
    • on-policy (e.g. epsilon greedy),
    • off policy (i.e. importance sampling)

38 of 41

Temporal Difference

  • We do not assume complete knowledge of the environment.
  • TD methods require only experience—sample sequences of states, actions, and rewards from actual or simulated interaction with an environment.
  • Bootstrap from estimates of other states
  • Monte Carlo need to wait until end of episode, while TD(0) methods only need to wait one step
  • On-policy (i.e. SARSA), off-policy (i.e. Q-learning)

39 of 41

Monte Carlo vs Temporal Difference

40 of 41

MCTS & DQN

(see other slides)

41 of 41

Questions?