2 of 69

Topics (Lectures 1-8):

Introduction to Reinforcement and Representation Learning
Multi-Arm Bandits
MDPs, Value and Policy Iteration
Monte Carlo Learning, Temporal Difference Learning, Monte Carlo Tree Search
Function Approximation, Deep Q learning
Policy gradients, REINFORCE, Actor-Critic methods

***Note this is not an exhaustive list. Anything covered in lectures in fair game.

3 of 69

What to expect?

Open note

Can only use downloaded content
No internet use

Types of questions:

True/False (with and without explanation)
Select all that apply (if explain - explain each choice why you did or did not select it)
Short answer
**If there is a box to explain, always explain your answer. Otherwise you will not get full credit.

4 of 69

Bandits

You have one state with k actions
Each action gets you a reward
Want to maximize reward in least amount of time

How to sample actions to do this efficiently?

Greedy action:

5 of 69

Epsilon-greedy bandits

Method One Epsilon Greedy:

First is epsilon greedy – this method is exploiting for the majority of the time (so choosing the greedy action) but with probability epsilon you pick an action at random (which is for exploration)

So in order to take the greedy action, you take the argmax across the Q-values

And then update the action-value functions with the reward that was the result of your action

The problem with epsilon greedy is that it continues to explore forever – so we will always be taking a random action, which causes us to suffer opportunity loss (since we are not choosing the most optimal action / the most greedy action) – this causes linear regret (which means we are not performing the best or gaining all of the reward possible at each timestep).

Therefore, we want to try to use a method that allows us to explore but also decreases the exploration over time, since epsilon greedy has a constant exploration behavior

6 of 69

Upper confidence bound

the square-root term is a measure of the uncertainty or variance in the estimate of a’s value
As Nt(a) increases the uncertainty term decreases.
On the other hand, each time an action other than a is selected, t increases but Nt(a) does not, causing the uncertainty to increase
The use of the natural logarithm means that the increases get smaller over time, but are unbounded

Similar to Epsilon, except with no stochasiticty

UCB tries to use uncertainty to guide exploration, where the algorithm chooses actions with high uncertainty (aka actions that have not been chosen very often) for exploration
UCB estimates an upper confidence for each action value so that with high probability, the true Q-value is less than the estimated Q-value plus some upper confidence bound

The way we choose the action is a greedy method of taking the action with the largest estimated upper bound (which again is the estimated Q-value plus the upper confidence bound)
The confidence bound becomes smaller as an action is picked more often, but increases with the total number of actions we have taken in total → this allows us to be become more confident (therefore less uncertain) of the true value for the Q-value when we choose an action.
Exploration then reduces over time as we become more confident of the Q-values for each action

7 of 69

Optimistic Initial Values

Set initial Q values much higher than the reward
Encourage some exploration initially
Whichever actions are initially selected, the reward is less than the starting estimates; the learner switches to other actions, being “disappointed” with the rewards it is receiving. The result is that all actions are tried several times before the value estimates converge. The system does a fair amount of exploration even if greedy actions are selected all the time.

Optimistic initialization allows us to have exploration reduced over time
The initial Q values are set to be high (therefore “optimistic”) and the algorithm always picks the greedy action (or the action with the highest Q-value)
Because all of the Q-values are high, the algorithm will then “explore” all of the actions (as the Q-values for already explored actions will then start to decrease towards more accurate estimates, leaving the algorithm to choose the unexplored actions with the original higher optimistic Q-values)
Once the algorithm has explored all of the actions, then the most greedy action will be the one with the highest Q-value, and by then these Q-values are no longer an “optimistic” estimate but a more accurate estimate – this allows the algorithm to then exploit instead of continuing to explore

8 of 69

Bandits

Boltzmann Exploration

Uses a Boltzmann distribution to pick actions, very similar to a softmax function
By using a probability distribution that is updated, it balances exploration and exploitation

Action selection for Boltzmann

9 of 69

MDPs

State-value function:

Action-value function:

policy:

* denotes optimal

10 of 69

Policy evaluation

Find value function for a given policy
Converges to unique true value function in limit
In practice, use iterative policy evaluation (below) - stop when max delta below a threshold
Can update value function “in place” or use two copies

11 of 69

Bellman Backup & Contraction Mapping Theorem

value function v_pi is the unique solution to its Bellman equation.
Bellman backup operator is gamma-contraction

12 of 69

Policy improvement

Given value function for current policy, do one-step look-ahead and check if it is better to change policy to new action in each state
Strictly improving except when policy is already optimal

13 of 69

Policy iteration

14 of 69

Value iteration

Combine policy evaluation/iteration in each state sweep
Convergence can be improved by running multiple steps of policy evaluation before policy improvement

15 of 69

Monte Carlo (prediction)

We do not assume complete knowledge of the environment.
Monte Carlo methods require only experience—sample sequences of states, actions, and rewards from actual or simulated interaction with an environment.
Average returns observed after visits to a state
Does not depend on estimates of other states (no bootstrapping)
Get rid of exploring starts (need to make sure and sample every (state, action) pair):

on-policy (e.g. epsilon greedy),
off policy (i.e. importance sampling)

16 of 69

Temporal Difference

We do not assume complete knowledge of the environment.
TD methods require only experience—sample sequences of states, actions, and rewards from actual or simulated interaction with an environment.
Bootstrap from estimates of other states
Monte Carlo need to wait until end of episode, while TD(0) methods only need to wait one step
On-policy (i.e. SARSA), off-policy (i.e. Q-learning, Expected SARSA)

17 of 69

Monte Carlo vs Temporal Difference

18 of 69

Policy-Gradient Methods

Value-based methods: learn a value function (an optimal value function leads to an optimal policy)

Goal: minimize the loss between the predicted and target value
Policy is implicit as it is generated directly from the value function (e.g. eps-greedy from Q-function)
Examples: Monte Carlo, DQN, SARSA

Policy-based methods: learn to approximate optimal policy directly (without learning a value function)

Parameterize the policy, e.g. using a neural network
Policy outputs a probability distribution over actions (stochastic policy)
Goal: maximize the performance of the parameterized policy using gradient ascent

19 of 69

REINFORCE: Algorithm

REINFORCE, or Monte Carlo policy-gradient, uses an estimated return from an entire episode to update the policy parameter θ.

In a loop,

Use the policy π_θ to collect episode τ
Use the episode to estimate the gradient g = ∇θJ(θ)

Update the weights of the policy: θ ← θ + αg

20 of 69

REINFORCE: Problem

Let’s suppose we have a 3 armed bandit environment where the mean rewards for the arms are 10, 5, and 2.5 (with normally distributed noise with 0 mean and 1 variance).

21 of 69

REINFORCE: Problem

Let’s suppose we have a 3 armed bandit environment where the mean rewards for the arms are 10, 5, and 2.5 (with normally distributed noise with 0 mean and 1 variance).
What would REINFORCE do?

22 of 69

REINFORCE: Problem

Let’s suppose we have a 3 armed bandit environment where the mean rewards for the arms are 10, 5, and 2.5 (with normally distributed noise with 0 mean and 1 variance).
What would REINFORCE do?

23 of 69

REINFORCE: Problem

Let’s suppose we have a 3 armed bandit environment where the mean rewards for the arms are 10, 5, and 2.5 (with normally distributed noise with 0 mean and 1 variance).
What would REINFORCE do?

24 of 69

REINFORCE: Problem

Let’s suppose we have a 3 armed bandit environment where the mean rewards for the arms are 10, 5, and 2.5 (with normally distributed noise with 0 mean and 1 variance).
What would REINFORCE do?

But all the rewards are positive...

25 of 69

REINFORCE: Problem

Let’s suppose we have a 3 armed bandit environment where the mean rewards for the arms are 10, 5, and 2.5 (with normally distributed noise with 0 mean and 1 variance).
What would REINFORCE do?

How do we improve this?

26 of 69

REINFORCE: Problem

Let’s suppose we have a 3 armed bandit environment where the mean rewards for the arms are 10, 5, and 2.5 (with normally distributed noise with 0 mean and 1 variance).
What would REINFORCE do?

Subtract the mean!

-(10+5+2.5)/3

27 of 69

REINFORCE: Problem

Let’s suppose we have a 3 armed bandit environment where the mean rewards for the arms are 10, 5, and 2.5 (with normally distributed noise with 0 mean and 1 variance).
What would REINFORCE do?

How do we generalize this?

-5.83

28 of 69

REINFORCE: Problem

Let’s suppose we have a 3 armed bandit environment where the mean rewards for the arms are 10, 5, and 2.5 (with normally distributed noise with 0 mean and 1 variance).
What would REINFORCE do?

Subtract a baseline!

-b(s_{t})

29 of 69

REINFORCE - Baseline: Algorithm

What did we make?

30 of 69

REINFORCE -> TD Algorithms

Let’s say I don’t want to wait until the end of each rollout to update states

Rollouts are very long and too computationally intensive to store all in memory
We are in a continuing task where we don’t even have episode termination

31 of 69

REINFORCE -> TD Algorithms

Let’s say I don’t want to wait until the end of each rollout to update states

Rollouts are very long and too computationally intensive to store all in memory
We are in a continuing task where we don’t even have episode termination

Bootstrapping!

32 of 69

REINFORCE -> TD Algorithms

Reinforce with baseline high level Pseudocode:

Sample trajectory T, get reward R
For each state (S, A) in T

Let G’ be the sum of the rewards from state S onwards
Update baseline B_S with G’ (update mean reward, fit function)
Advantage Adv = (R - B_S)
Update policy pi via Adv * grad log pi(A|S)

33 of 69

REINFORCE -> TD Algorithms

Reinforce with baseline high level Pseudocode:

Sample trajectory T, get reward R
For each state (S, A) in T:

Let G’ be the sum of the rewards from state S onwards
Update baseline B_S with G’ (update mean reward, fit function)
Advantage Adv = (R - B_S)
Update policy pi via Adv * grad log pi(A|S)

TD with baseline high level Pseudocode:

Start at state S = S_0

34 of 69

REINFORCE -> TD Algorithms

Reinforce with baseline high level Pseudocode:

Sample trajectory T, get reward R
For each state (S, A) in T:

Let G’ be the sum of the rewards from state S onwards
Update baseline B_S with G’ (update mean reward, fit function)
Advantage Adv = (R - B_S)
Update policy pi via Adv * grad log pi(A|S)

TD with baseline high level Pseudocode:

Start at state S = S_0
While True, resetting S when necessary:

35 of 69

REINFORCE -> TD Algorithms

Reinforce with baseline high level Pseudocode:

Sample trajectory T, get reward R
For each state (S, A) in T:

Let G’ be the sum of the rewards from state S onwards
Update baseline B_S with G’ (update mean reward, fit function)
Advantage Adv = (R - B_S)
Update policy pi via Adv * grad log pi(A|S)

TD with baseline high level Pseudocode:

Start at state S = S_0
While True, resetting S when necessary:

From state S, take action A, get reward R, go to state S’

36 of 69

REINFORCE -> TD Algorithms

Reinforce with baseline high level Pseudocode:

Sample trajectory T, get reward R
For each state (S, A) in T:

Let G’ be the sum of the rewards from state S onwards
Update baseline B_S with G’ (update mean reward, fit function)
Advantage Adv = (R - B_S)
Update policy pi via Adv * grad log pi(A|S)

TD with baseline high level Pseudocode:

Start at state S = S_0
While True, resetting S when necessary:

From state S, take action A, get reward R, go to state S’
Let G’ = R + gamma V(S’)

37 of 69

REINFORCE -> TD Algorithms

Reinforce with baseline high level Pseudocode:

Sample trajectory T, get reward R
For each state (S, A) in T:

Let G’ be the sum of the rewards from state S onwards
Update baseline B_S with G’ (update mean reward, fit function)
Advantage Adv = (R - B_S)
Update policy pi via Adv * grad log pi(A|S)

TD with baseline high level Pseudocode:

Start at state S = S_0
While True, resetting S when necessary:

From state S, take action A, get reward R, go to state S’
Let G’ = R + gamma V(S’)

Question: What is our new baseline for state S? (i.e. expected return of S)

38 of 69

REINFORCE -> TD Algorithms

Reinforce with baseline high level Pseudocode:

Sample trajectory T, get reward R
For each state (S, A) in T:

Let G’ be the sum of the rewards from state S onwards
Update baseline B_S with G’ (update mean reward, fit function)
Advantage Adv = (R - B_S)
Update policy pi via Adv * grad log pi(A|S)

TD with baseline high level Pseudocode:

Start at state S = S_0
While True, resetting S when necessary:

From state S, take action A, get reward R, go to state S’
Let G’ = R + gamma V(S’)
Update baseline B_s = V(s) with G’ (fit value function V)

39 of 69

REINFORCE -> TD Algorithms

Reinforce with baseline high level Pseudocode:

Sample trajectory T, get reward R
For each state (S, A) in T:

Let G’ be the sum of the rewards from state S onwards
Update baseline B_S with G’ (update mean reward, fit function)
Advantage Adv = (R - B_S)
Update policy pi via Adv * grad log pi(A|S)

TD with baseline high level Pseudocode:

Start at state S = S_0
While True, resetting S when necessary:

From state S, take action A, get reward R, go to state S’
Let G’ = R + gamma V(S’)
Update baseline B_s = V(s) with G’ (fit value function V)
Advantage Adv = (G’ - B_s)

40 of 69

REINFORCE -> TD Algorithms

Reinforce with baseline high level Pseudocode:

Sample trajectory T, get reward R
For each state (S, A) in T:

Let G’ be the sum of the rewards from state S onwards
Update baseline B_S with G’ (update mean reward, fit function)
Advantage Adv = (R - B_S)
Update policy pi via Adv * grad log pi(A|S)

TD with baseline high level Pseudocode:

Start at state S = S_0
While True, resetting S when necessary:

From state S, take action A, get reward R, go to state S’
Let G’ = R + gamma V(S’)
Update baseline B_s = V(s) with G’ (fit value function V)
Advantage Adv = (G’ - B_s)
Update policy pi via Adv * grad log pi(A|S)

41 of 69

REINFORCE - Baseline: Problem

Let’s change the advantage function

A bit neater now:

42 of 69

Advantage Actor Critic (A2C): Differences reminder

What did we make this time?
Let’s distill the changes we made

A(s,a) = G_{t} - b(s_{t})

REINFORCE

REINFORCE - Baseline

A2C

43 of 69

Actor Critic vs Advantage Actor Critic

Actor Critic

We don’t directly use the discounted cumulative rewards to calculate the policy update

Advantage Actor Critic

We use the advantage function (or an approximation) to calculate the policy update

44 of 69

Key Fact About baseline methods

Baselines do not change the policy gradient formulation, but can reduce variance!

Consider two states with values -100 and 100. Without baselines, the gradient (R + gamma V(S’)) grad log pi(A|S) will have very high variance, hurting performance
A baseline is permissible IFF it is the same (in expectation) for any action from a given state S in a given trajectory T (for instance, V(S) and Average G_t from S are safe since it is not dependent on the next action A taken from state S)

45 of 69

Policy-based methods, pros and cons

Pros

We can estimate the policy directly without storing additional data
Policy-gradient methods can learn a stochastic policy

We don’t need to implement an exploration/exploitation trade-off by hand

More effective in high-dimensional action spaces and continuous action spaces 🤔
Better convergence properties 🤔

Cons

Converges to a local maximum sometimes
Slower, step-by-step: it can take longer to train (inefficient)
Gradient estimate is very noisy: there is a possibility that the collected trajectory may not be representative of the policy

46 of 69

Monte Carlo Tree Search

state

action

Until termination

Update

values

47 of 69

Monte Carlo Tree Search: Benefits

When to use MCTS over learning algorithms?

Access to internal model
In limit, can fully explore environment
Achieves very good Q values for states it explores thoroughly
Size or dynamic nature of the state-action space (different compute budget allows for more/less exploration)
Can be useful instead of learning algos if you want to learn Q value of single starting state/want to find a single good trajectory to take

48 of 69

Monte Carlo Tree Search: Cons

Takes a long time to run
Requires storage of a tree, which can be computationally expensive

49 of 69

Deep Q-Network

Sampling: we perform actions and store the observed experience tuples in a replay memory

Training: select a small batch of tuples randomly and learn from this batch using a gradient descent update step

Note we sample from Q_w, but get training labels y_i from Q_target

50 of 69

Deep Q-Network

Because deep Q-learning combines a non-linear Q-value function (Neural network) with bootstrapping (when we update targets with existing estimates and not an actual complete return), it might suffer from instability.

To help us stabilize the training, we implement three different solutions:

Experience Replay (discussed in next slide).
Fixed Q-Target to stabilize the training.
Double Deep Q-Learning, to handle the problem of the overestimation of Q-values (update network 1 with action chosen by network 1 but Q value from network 2).

51 of 69

Deep Q-Learning: Experience Replay

Uses the experiences of the training more efficiently (we can use a replay buffer that saves experience samples that we can reuse during sampling)

Agent can learn from the same experience multiple times!

Avoid forgetting previous experiences and reduce the correlation between experiences

if we give sequential samples of experiences to our neural network is that it tends to forget the previous experiences as it gets new experiences

By randomly sampling experiences, we remove correlation in the observation sequences to avoid actin values from oscillating or diverging catastrophically.

52 of 69

Deep Q-Learning: Fixed Q-Target

Problem: at every step of training, both our Q-values and target values shift (nonstationary targets)

Where the most instability comes from
Updating the network weights changes the target value, which requires more updates
Unintended generalization to other states S’ can lead to error propagation

Solution: use a separate network with fixed parameters to estimate the TD target and compy the parameters from our Deep Q-Network every c steps

For c steps, the target network is fixed, after that you update the target network once and continue to update your value function for another c steps, repeat the process
Network has more time to fit targets accurately before they change
Slows down training, but not too many alternatives (recently: functional regularization)

Target value

Prediction

53 of 69

Deep Q-Learning: Fixed Q-Target

54 of 69

Big Picture Table

Method	On/Off Policy?	Bootstraps?
Monte Carlo Methods	On*	N
SARSA	On*	Y
Expected SARSA	Either	Y
Q-Learning	Off	Y
REINFORCE	On*	N
Actor Critic, A2C	On*	Y

* can be made off policy with importance sampling

55 of 69

Practice Questions

56 of 69

Question 1) Bandits

Why does the expected reward curve for UCB look so noisy, especially compared to epsilon-greedy or Boltzmann?

57 of 69

Solution 1) Bandits

At every timestep, the UCB policy is more likely to rapidly switch between actions in order to explore, since action have their upper bound penalized when they are chosen.

In contrast, Boltzmann uses a stochastic policy with a smooth distribution, so its expected reward changes smoothly as the action distribution changes.

Epsilon greedy only changes when mean reward of one arm passes the mean reward of another, which happens much more rarely than the UCB changes

58 of 69

Question 2) Scaling Rewards

Let’s say that we have 2 3-armed bandits:

Mean rewards (-1, 0, 1) and noisy Gaussian reward with variance 1
Mean rewards (-10, 0, 10) and noisy Gaussian reward with variance 100

For the same random seed, does epsilon-greedy take the same sequence of actions on either task? How about UCB?

59 of 69

Solution 2) Scaling Rewards

Epsilon-greedy is scale-invariant; only the maximum Q-value determines the policy. Therefore, it takes the same sequence of actions on both. UCB however, may give different results, as the upper confidence bounds will be the same in both cases but the Q values wont.

In fact, we would expect UCB to be more likely to switch actions frequently in the top case

60 of 69

Question 3) Markov Decision Processes

61 of 69

Solution 3) Markov Decision Processes

62 of 69

Question 4) Comparing SARSA and Q-Learning

Given some trajectory:

We can define an update target for Q-values at step 0:

Using V(s), apply bootstrapping for 2-step returns.

Do the same, but in terms of Q(s, a).

Do the same, assuming the policy takes optimal actions with respect to its Q values.

63 of 69

Question 5) Comparing SARSA and Q-Learning

At each timestep of the policy iteration algorithm, the expected reward of the current policy is guaranteed to improve or remain the same.

64 of 69

Answer 5) Value and Policy Iteration

True, this is the Policy Improvement Theorem.

65 of 69

Question 6) SARSA

Was SARSA designed to learn Q-values using samples from a replay buffer of transitions collected from old policies? What about expected SARSA?

66 of 69

Answer 6) SARSA

SARSA:

Expected SARSA:

SARSA is on-policy: A’ is supposed to be drawn from the policy. (It can be extended to use a replay buffer, but this isn’t in the original formulation). In contrast, expected SARSA is designed to use (s, a, s’, r) from any source to perform updates.

67 of 69

Question 7) MCTS & DQN

Which of the following statements about MCTS and DQNs is incorrect:

A) MCTS uses a tabular representations of action-values whereas DQNs uses a functional approximation.

B) Agents using DQNs are faster at choosing actions compared to those using MCTS.

C) MCTS gives the optimal Q value for the root state it is exploring and optimal action a

D) DQNs and MCTS are both on-policy.

E) MCTS have been shown to outperform comparable DQNs on some tasks.

1 of 69

2 of 69

3 of 69

4 of 69

5 of 69

6 of 69

7 of 69

8 of 69

9 of 69

10 of 69

11 of 69

12 of 69

13 of 69

14 of 69

15 of 69

16 of 69

17 of 69

18 of 69

19 of 69

20 of 69

21 of 69

22 of 69

23 of 69

24 of 69

25 of 69

26 of 69

27 of 69

28 of 69

29 of 69

30 of 69

31 of 69

32 of 69

33 of 69

34 of 69

35 of 69

36 of 69

37 of 69

38 of 69

39 of 69

40 of 69

41 of 69

42 of 69

43 of 69

44 of 69

45 of 69

46 of 69

47 of 69

48 of 69

49 of 69

50 of 69

51 of 69

52 of 69

53 of 69

54 of 69

55 of 69

56 of 69

57 of 69

58 of 69

59 of 69

60 of 69

61 of 69

62 of 69

63 of 69

64 of 69

65 of 69

66 of 69

67 of 69

68 of 69

69 of 69