1 of 69

10-703 Recitation 5

2 of 69

Topics (Lectures 1-8):

  1. Introduction to Reinforcement and Representation Learning
  2. Multi-Arm Bandits
  3. MDPs, Value and Policy Iteration
  4. Monte Carlo Learning, Temporal Difference Learning, Monte Carlo Tree Search
  5. Function Approximation, Deep Q learning
  6. Policy gradients, REINFORCE, Actor-Critic methods

***Note this is not an exhaustive list. Anything covered in lectures in fair game.

3 of 69

What to expect?

  • Open note
    • Can only use downloaded content
    • No internet use
  • Types of questions:
    • True/False (with and without explanation)
    • Select all that apply (if explain - explain each choice why you did or did not select it)
    • Short answer
    • **If there is a box to explain, always explain your answer. Otherwise you will not get full credit.

4 of 69

Bandits

  • You have one state with k actions
  • Each action gets you a reward
  • Want to maximize reward in least amount of time
    • How to sample actions to do this efficiently?

Greedy action:

5 of 69

Epsilon-greedy bandits

6 of 69

Upper confidence bound

  • the square-root term is a measure of the uncertainty or variance in the estimate of a’s value
  • As Nt(a) increases the uncertainty term decreases.
  • On the other hand, each time an action other than a is selected, t increases but Nt(a) does not, causing the uncertainty to increase
  • The use of the natural logarithm means that the increases get smaller over time, but are unbounded

7 of 69

Optimistic Initial Values

  • Set initial Q values much higher than the reward
  • Encourage some exploration initially
  • Whichever actions are initially selected, the reward is less than the starting estimates; the learner switches to other actions, being “disappointed” with the rewards it is receiving. The result is that all actions are tried several times before the value estimates converge. The system does a fair amount of exploration even if greedy actions are selected all the time.

8 of 69

Bandits

  • Boltzmann Exploration
    • Uses a Boltzmann distribution to pick actions, very similar to a softmax function
    • By using a probability distribution that is updated, it balances exploration and exploitation

Action selection for Boltzmann

9 of 69

MDPs

State-value function:

Action-value function:

policy:

* denotes optimal

10 of 69

Policy evaluation

  • Find value function for a given policy
  • Converges to unique true value function in limit
  • In practice, use iterative policy evaluation (below) - stop when max delta below a threshold
  • Can update value function “in place” or use two copies

11 of 69

Bellman Backup & Contraction Mapping Theorem

  • value function v_pi is the unique solution to its Bellman equation.
  • Bellman backup operator is gamma-contraction

12 of 69

Policy improvement

  • Given value function for current policy, do one-step look-ahead and check if it is better to change policy to new action in each state
  • Strictly improving except when policy is already optimal

13 of 69

Policy iteration

14 of 69

Value iteration

  • Combine policy evaluation/iteration in each state sweep
  • Convergence can be improved by running multiple steps of policy evaluation before policy improvement

15 of 69

Monte Carlo (prediction)

  • We do not assume complete knowledge of the environment.
  • Monte Carlo methods require only experience—sample sequences of states, actions, and rewards from actual or simulated interaction with an environment.
  • Average returns observed after visits to a state
  • Does not depend on estimates of other states (no bootstrapping)
  • Get rid of exploring starts (need to make sure and sample every (state, action) pair):
    • on-policy (e.g. epsilon greedy),
    • off policy (i.e. importance sampling)

16 of 69

Temporal Difference

  • We do not assume complete knowledge of the environment.
  • TD methods require only experience—sample sequences of states, actions, and rewards from actual or simulated interaction with an environment.
  • Bootstrap from estimates of other states
  • Monte Carlo need to wait until end of episode, while TD(0) methods only need to wait one step
  • On-policy (i.e. SARSA), off-policy (i.e. Q-learning, Expected SARSA)

17 of 69

Monte Carlo vs Temporal Difference

18 of 69

Policy-Gradient Methods

  • Value-based methods: learn a value function (an optimal value function leads to an optimal policy)
    • Goal: minimize the loss between the predicted and target value
    • Policy is implicit as it is generated directly from the value function (e.g. eps-greedy from Q-function)
    • Examples: Monte Carlo, DQN, SARSA
  • Policy-based methods: learn to approximate optimal policy directly (without learning a value function)
    • Parameterize the policy, e.g. using a neural network
    • Policy outputs a probability distribution over actions (stochastic policy)
    • Goal: maximize the performance of the parameterized policy using gradient ascent

18

19 of 69

REINFORCE: Algorithm

REINFORCE, or Monte Carlo policy-gradient, uses an estimated return from an entire episode to update the policy parameter θ.

In a loop,

  1. Use the policy πθ to collect episode τ
  2. Use the episode to estimate the gradient g = θJ(θ)

  • Update the weights of the policy: θ ← θ + αg

19

20 of 69

REINFORCE: Problem

  • Let’s suppose we have a 3 armed bandit environment where the mean rewards for the arms are 10, 5, and 2.5 (with normally distributed noise with 0 mean and 1 variance).

20

21 of 69

REINFORCE: Problem

  • Let’s suppose we have a 3 armed bandit environment where the mean rewards for the arms are 10, 5, and 2.5 (with normally distributed noise with 0 mean and 1 variance).
  • What would REINFORCE do?

21

22 of 69

REINFORCE: Problem

  • Let’s suppose we have a 3 armed bandit environment where the mean rewards for the arms are 10, 5, and 2.5 (with normally distributed noise with 0 mean and 1 variance).
  • What would REINFORCE do?

22

23 of 69

REINFORCE: Problem

  • Let’s suppose we have a 3 armed bandit environment where the mean rewards for the arms are 10, 5, and 2.5 (with normally distributed noise with 0 mean and 1 variance).
  • What would REINFORCE do?

23

24 of 69

REINFORCE: Problem

  • Let’s suppose we have a 3 armed bandit environment where the mean rewards for the arms are 10, 5, and 2.5 (with normally distributed noise with 0 mean and 1 variance).
  • What would REINFORCE do?

24

But all the rewards are positive...

25 of 69

REINFORCE: Problem

  • Let’s suppose we have a 3 armed bandit environment where the mean rewards for the arms are 10, 5, and 2.5 (with normally distributed noise with 0 mean and 1 variance).
  • What would REINFORCE do?

25

How do we improve this?

26 of 69

REINFORCE: Problem

  • Let’s suppose we have a 3 armed bandit environment where the mean rewards for the arms are 10, 5, and 2.5 (with normally distributed noise with 0 mean and 1 variance).
  • What would REINFORCE do?

26

Subtract the mean!

-(10+5+2.5)/3

27 of 69

REINFORCE: Problem

  • Let’s suppose we have a 3 armed bandit environment where the mean rewards for the arms are 10, 5, and 2.5 (with normally distributed noise with 0 mean and 1 variance).
  • What would REINFORCE do?

27

How do we generalize this?

-5.83

28 of 69

REINFORCE: Problem

  • Let’s suppose we have a 3 armed bandit environment where the mean rewards for the arms are 10, 5, and 2.5 (with normally distributed noise with 0 mean and 1 variance).
  • What would REINFORCE do?

28

Subtract a baseline!

-b(s_{t})

29 of 69

REINFORCE - Baseline: Algorithm

  • What did we make?

29

30 of 69

REINFORCE -> TD Algorithms

  • Let’s say I don’t want to wait until the end of each rollout to update states
    1. Rollouts are very long and too computationally intensive to store all in memory
    2. We are in a continuing task where we don’t even have episode termination

30

31 of 69

REINFORCE -> TD Algorithms

  • Let’s say I don’t want to wait until the end of each rollout to update states
    • Rollouts are very long and too computationally intensive to store all in memory
    • We are in a continuing task where we don’t even have episode termination

31

Bootstrapping!

32 of 69

REINFORCE -> TD Algorithms

  • Reinforce with baseline high level Pseudocode:
    • Sample trajectory T, get reward R
    • For each state (S, A) in T
      1. Let G’ be the sum of the rewards from state S onwards
      2. Update baseline B_S with G’ (update mean reward, fit function)
      3. Advantage Adv = (R - B_S)
      4. Update policy pi via Adv * grad log pi(A|S)

32

33 of 69

REINFORCE -> TD Algorithms

  • Reinforce with baseline high level Pseudocode:
    • Sample trajectory T, get reward R
    • For each state (S, A) in T:
      • Let G’ be the sum of the rewards from state S onwards
      • Update baseline B_S with G’ (update mean reward, fit function)
      • Advantage Adv = (R - B_S)
      • Update policy pi via Adv * grad log pi(A|S)
  • TD with baseline high level Pseudocode:
    • Start at state S = S_0

33

34 of 69

REINFORCE -> TD Algorithms

  • Reinforce with baseline high level Pseudocode:
    • Sample trajectory T, get reward R
    • For each state (S, A) in T:
      • Let G’ be the sum of the rewards from state S onwards
      • Update baseline B_S with G’ (update mean reward, fit function)
      • Advantage Adv = (R - B_S)
      • Update policy pi via Adv * grad log pi(A|S)
  • TD with baseline high level Pseudocode:
    • Start at state S = S_0
    • While True, resetting S when necessary:

34

35 of 69

REINFORCE -> TD Algorithms

  • Reinforce with baseline high level Pseudocode:
    • Sample trajectory T, get reward R
    • For each state (S, A) in T:
      • Let G’ be the sum of the rewards from state S onwards
      • Update baseline B_S with G’ (update mean reward, fit function)
      • Advantage Adv = (R - B_S)
      • Update policy pi via Adv * grad log pi(A|S)
  • TD with baseline high level Pseudocode:
    • Start at state S = S_0
    • While True, resetting S when necessary:
      • From state S, take action A, get reward R, go to state S’

35

36 of 69

REINFORCE -> TD Algorithms

  • Reinforce with baseline high level Pseudocode:
    • Sample trajectory T, get reward R
    • For each state (S, A) in T:
      • Let G’ be the sum of the rewards from state S onwards
      • Update baseline B_S with G’ (update mean reward, fit function)
      • Advantage Adv = (R - B_S)
      • Update policy pi via Adv * grad log pi(A|S)
  • TD with baseline high level Pseudocode:
    • Start at state S = S_0
    • While True, resetting S when necessary:
      • From state S, take action A, get reward R, go to state S’
      • Let G’ = R + gamma V(S’)

36

37 of 69

REINFORCE -> TD Algorithms

  • Reinforce with baseline high level Pseudocode:
    • Sample trajectory T, get reward R
    • For each state (S, A) in T:
      • Let G’ be the sum of the rewards from state S onwards
      • Update baseline B_S with G’ (update mean reward, fit function)
      • Advantage Adv = (R - B_S)
      • Update policy pi via Adv * grad log pi(A|S)
  • TD with baseline high level Pseudocode:
    • Start at state S = S_0
    • While True, resetting S when necessary:
      • From state S, take action A, get reward R, go to state S’
      • Let G’ = R + gamma V(S’)

Question: What is our new baseline for state S? (i.e. expected return of S)

37

38 of 69

REINFORCE -> TD Algorithms

  • Reinforce with baseline high level Pseudocode:
    • Sample trajectory T, get reward R
    • For each state (S, A) in T:
      • Let G’ be the sum of the rewards from state S onwards
      • Update baseline B_S with G’ (update mean reward, fit function)
      • Advantage Adv = (R - B_S)
      • Update policy pi via Adv * grad log pi(A|S)
  • TD with baseline high level Pseudocode:
    • Start at state S = S_0
    • While True, resetting S when necessary:
      • From state S, take action A, get reward R, go to state S’
      • Let G’ = R + gamma V(S’)
      • Update baseline B_s = V(s) with G’ (fit value function V)

38

39 of 69

REINFORCE -> TD Algorithms

  • Reinforce with baseline high level Pseudocode:
    • Sample trajectory T, get reward R
    • For each state (S, A) in T:
      • Let G’ be the sum of the rewards from state S onwards
      • Update baseline B_S with G’ (update mean reward, fit function)
      • Advantage Adv = (R - B_S)
      • Update policy pi via Adv * grad log pi(A|S)
  • TD with baseline high level Pseudocode:
    • Start at state S = S_0
    • While True, resetting S when necessary:
      • From state S, take action A, get reward R, go to state S’
      • Let G’ = R + gamma V(S’)
      • Update baseline B_s = V(s) with G’ (fit value function V)
      • Advantage Adv = (G’ - B_s)

39

40 of 69

REINFORCE -> TD Algorithms

  • Reinforce with baseline high level Pseudocode:
    • Sample trajectory T, get reward R
    • For each state (S, A) in T:
      • Let G’ be the sum of the rewards from state S onwards
      • Update baseline B_S with G’ (update mean reward, fit function)
      • Advantage Adv = (R - B_S)
      • Update policy pi via Adv * grad log pi(A|S)
  • TD with baseline high level Pseudocode:
    • Start at state S = S_0
    • While True, resetting S when necessary:
      • From state S, take action A, get reward R, go to state S’
      • Let G’ = R + gamma V(S’)
      • Update baseline B_s = V(s) with G’ (fit value function V)
      • Advantage Adv = (G’ - B_s)
      • Update policy pi via Adv * grad log pi(A|S)

40

41 of 69

REINFORCE - Baseline: Problem

  • Let’s change the advantage function

41

A bit neater now:

42 of 69

Advantage Actor Critic (A2C): Differences reminder

  • What did we make this time?
  • Let’s distill the changes we made

42

A(s,a) = G_{t} - b(s_{t})

A(s,a) = G_{t} - b(s_{t})

REINFORCE

REINFORCE - Baseline

REINFORCE - Baseline

A2C

43 of 69

Actor Critic vs Advantage Actor Critic

  1. Actor Critic
    1. We don’t directly use the discounted cumulative rewards to calculate the policy update
  2. Advantage Actor Critic
    • We use the advantage function (or an approximation) to calculate the policy update

43

44 of 69

Key Fact About baseline methods

  • Baselines do not change the policy gradient formulation, but can reduce variance!
    • Consider two states with values -100 and 100. Without baselines, the gradient (R + gamma V(S’)) grad log pi(A|S) will have very high variance, hurting performance
    • A baseline is permissible IFF it is the same (in expectation) for any action from a given state S in a given trajectory T (for instance, V(S) and Average G_t from S are safe since it is not dependent on the next action A taken from state S)

44

45 of 69

Policy-based methods, pros and cons

Pros

  • We can estimate the policy directly without storing additional data
  • Policy-gradient methods can learn a stochastic policy
    • We don’t need to implement an exploration/exploitation trade-off by hand
  • More effective in high-dimensional action spaces and continuous action spaces 🤔
  • Better convergence properties 🤔

Cons

  • Converges to a local maximum sometimes
  • Slower, step-by-step: it can take longer to train (inefficient)
  • Gradient estimate is very noisy: there is a possibility that the collected trajectory may not be representative of the policy

45

46 of 69

Monte Carlo Tree Search

state

action

Until termination

Update

values

47 of 69

Monte Carlo Tree Search: Benefits

When to use MCTS over learning algorithms?

  • Access to internal model
  • In limit, can fully explore environment
  • Achieves very good Q values for states it explores thoroughly
  • Size or dynamic nature of the state-action space (different compute budget allows for more/less exploration)
  • Can be useful instead of learning algos if you want to learn Q value of single starting state/want to find a single good trajectory to take

48 of 69

Monte Carlo Tree Search: Cons

  • Takes a long time to run
  • Requires storage of a tree, which can be computationally expensive

49 of 69

Deep Q-Network

Sampling: we perform actions and store the observed experience tuples in a replay memory

Training: select a small batch of tuples randomly and learn from this batch using a gradient descent update step

Note we sample from Q_w, but get training labels y_i from Q_target

50 of 69

Deep Q-Network

Because deep Q-learning combines a non-linear Q-value function (Neural network) with bootstrapping (when we update targets with existing estimates and not an actual complete return), it might suffer from instability.

To help us stabilize the training, we implement three different solutions:

  1. Experience Replay (discussed in next slide).
  2. Fixed Q-Target to stabilize the training.
  3. Double Deep Q-Learning, to handle the problem of the overestimation of Q-values (update network 1 with action chosen by network 1 but Q value from network 2).

51 of 69

Deep Q-Learning: Experience Replay

Uses the experiences of the training more efficiently (we can use a replay buffer that saves experience samples that we can reuse during sampling)

  • Agent can learn from the same experience multiple times!

Avoid forgetting previous experiences and reduce the correlation between experiences

  • if we give sequential samples of experiences to our neural network is that it tends to forget the previous experiences as it gets new experiences

By randomly sampling experiences, we remove correlation in the observation sequences to avoid actin values from oscillating or diverging catastrophically.

51

52 of 69

Deep Q-Learning: Fixed Q-Target

  • Problem: at every step of training, both our Q-values and target values shift (nonstationary targets)
    • Where the most instability comes from
    • Updating the network weights changes the target value, which requires more updates
    • Unintended generalization to other states S’ can lead to error propagation
  • Solution: use a separate network with fixed parameters to estimate the TD target and compy the parameters from our Deep Q-Network every c steps
    • For c steps, the target network is fixed, after that you update the target network once and continue to update your value function for another c steps, repeat the process
    • Network has more time to fit targets accurately before they change
    • Slows down training, but not too many alternatives (recently: functional regularization)

52

Target value

Prediction

53 of 69

Deep Q-Learning: Fixed Q-Target

54 of 69

Big Picture Table

Method

On/Off Policy?

Bootstraps?

Monte Carlo Methods

On*

N

SARSA

On*

Y

Expected SARSA

Either

Y

Q-Learning

Off

Y

REINFORCE

On*

N

Actor Critic, A2C

On*

Y

* can be made off policy with importance sampling

55 of 69

Practice Questions

56 of 69

Question 1) Bandits

Why does the expected reward curve for UCB look so noisy, especially compared to epsilon-greedy or Boltzmann?

57 of 69

Solution 1) Bandits

At every timestep, the UCB policy is more likely to rapidly switch between actions in order to explore, since action have their upper bound penalized when they are chosen.

In contrast, Boltzmann uses a stochastic policy with a smooth distribution, so its expected reward changes smoothly as the action distribution changes.

Epsilon greedy only changes when mean reward of one arm passes the mean reward of another, which happens much more rarely than the UCB changes

58 of 69

Question 2) Scaling Rewards

Let’s say that we have 2 3-armed bandits:

  • Mean rewards (-1, 0, 1) and noisy Gaussian reward with variance 1
  • Mean rewards (-10, 0, 10) and noisy Gaussian reward with variance 100

For the same random seed, does epsilon-greedy take the same sequence of actions on either task? How about UCB?

59 of 69

Solution 2) Scaling Rewards

Epsilon-greedy is scale-invariant; only the maximum Q-value determines the policy. Therefore, it takes the same sequence of actions on both. UCB however, may give different results, as the upper confidence bounds will be the same in both cases but the Q values wont.

In fact, we would expect UCB to be more likely to switch actions frequently in the top case

60 of 69

Question 3) Markov Decision Processes

61 of 69

Solution 3) Markov Decision Processes

62 of 69

Question 4) Comparing SARSA and Q-Learning

Given some trajectory:

We can define an update target for Q-values at step 0:

Using V(s), apply bootstrapping for 2-step returns.

Do the same, but in terms of Q(s, a).

Do the same, assuming the policy takes optimal actions with respect to its Q values.

63 of 69

Question 5) Comparing SARSA and Q-Learning

At each timestep of the policy iteration algorithm, the expected reward of the current policy is guaranteed to improve or remain the same.

64 of 69

Answer 5) Value and Policy Iteration

True, this is the Policy Improvement Theorem.

65 of 69

Question 6) SARSA

Was SARSA designed to learn Q-values using samples from a replay buffer of transitions collected from old policies? What about expected SARSA?

66 of 69

Answer 6) SARSA

SARSA:

Expected SARSA:

SARSA is on-policy: A’ is supposed to be drawn from the policy. (It can be extended to use a replay buffer, but this isn’t in the original formulation). In contrast, expected SARSA is designed to use (s, a, s’, r) from any source to perform updates.

67 of 69

Question 7) MCTS & DQN

Which of the following statements about MCTS and DQNs is incorrect:

A) MCTS uses a tabular representations of action-values whereas DQNs uses a functional approximation.

B) Agents using DQNs are faster at choosing actions compared to those using MCTS.

C) MCTS gives the optimal Q value for the root state it is exploring and optimal action a

D) DQNs and MCTS are both on-policy.

E) MCTS have been shown to outperform comparable DQNs on some tasks.

68 of 69

Answer 7) MCTS & DQN

C) MCTS gives the optimal Q value for the root state it is exploring and optimal action a

D) DQNs and MCTS are both on-policy.

69 of 69

Questions?