10-703 Recitation 5
Topics (Lectures 1-8):
***Note this is not an exhaustive list. Anything covered in lectures in fair game.
What to expect?
Bandits
Greedy action:
Epsilon-greedy bandits
Upper confidence bound
Optimistic Initial Values
Bandits
Action selection for Boltzmann
MDPs
State-value function:
Action-value function:
policy:
* denotes optimal
Policy evaluation
Bellman Backup & Contraction Mapping Theorem
Policy improvement
Policy iteration
Value iteration
Monte Carlo (prediction)
Temporal Difference
Monte Carlo vs Temporal Difference
Policy-Gradient Methods
18
REINFORCE: Algorithm
REINFORCE, or Monte Carlo policy-gradient, uses an estimated return from an entire episode to update the policy parameter θ.
In a loop,
19
REINFORCE: Problem
20
REINFORCE: Problem
21
REINFORCE: Problem
22
REINFORCE: Problem
23
REINFORCE: Problem
24
But all the rewards are positive...
REINFORCE: Problem
25
How do we improve this?
REINFORCE: Problem
26
Subtract the mean!
-(10+5+2.5)/3
REINFORCE: Problem
27
How do we generalize this?
-5.83
REINFORCE: Problem
28
Subtract a baseline!
-b(s_{t})
REINFORCE - Baseline: Algorithm
29
REINFORCE -> TD Algorithms
30
REINFORCE -> TD Algorithms
31
Bootstrapping!
REINFORCE -> TD Algorithms
32
REINFORCE -> TD Algorithms
33
REINFORCE -> TD Algorithms
34
REINFORCE -> TD Algorithms
35
REINFORCE -> TD Algorithms
36
REINFORCE -> TD Algorithms
Question: What is our new baseline for state S? (i.e. expected return of S)
37
REINFORCE -> TD Algorithms
38
REINFORCE -> TD Algorithms
39
REINFORCE -> TD Algorithms
40
REINFORCE - Baseline: Problem
41
A bit neater now:
Advantage Actor Critic (A2C): Differences reminder
42
A(s,a) = G_{t} - b(s_{t})
A(s,a) = G_{t} - b(s_{t})
REINFORCE
REINFORCE - Baseline
REINFORCE - Baseline
A2C
Actor Critic vs Advantage Actor Critic
43
Key Fact About baseline methods
44
Policy-based methods, pros and cons
Pros
Cons
45
Monte Carlo Tree Search
state
action
Until termination
Update
values
Monte Carlo Tree Search: Benefits
When to use MCTS over learning algorithms?
Monte Carlo Tree Search: Cons
Deep Q-Network
Sampling: we perform actions and store the observed experience tuples in a replay memory
Training: select a small batch of tuples randomly and learn from this batch using a gradient descent update step
Note we sample from Q_w, but get training labels y_i from Q_target
Deep Q-Network
Because deep Q-learning combines a non-linear Q-value function (Neural network) with bootstrapping (when we update targets with existing estimates and not an actual complete return), it might suffer from instability.
To help us stabilize the training, we implement three different solutions:
Deep Q-Learning: Experience Replay
Uses the experiences of the training more efficiently (we can use a replay buffer that saves experience samples that we can reuse during sampling)
Avoid forgetting previous experiences and reduce the correlation between experiences
By randomly sampling experiences, we remove correlation in the observation sequences to avoid actin values from oscillating or diverging catastrophically.
51
Deep Q-Learning: Fixed Q-Target
52
Target value
Prediction
Deep Q-Learning: Fixed Q-Target
Big Picture Table
Method | On/Off Policy? | Bootstraps? |
Monte Carlo Methods | On* | N |
SARSA | On* | Y |
Expected SARSA | Either | Y |
Q-Learning | Off | Y |
REINFORCE | On* | N |
Actor Critic, A2C | On* | Y |
* can be made off policy with importance sampling
Practice Questions
Question 1) Bandits
Why does the expected reward curve for UCB look so noisy, especially compared to epsilon-greedy or Boltzmann?
Solution 1) Bandits
At every timestep, the UCB policy is more likely to rapidly switch between actions in order to explore, since action have their upper bound penalized when they are chosen.
In contrast, Boltzmann uses a stochastic policy with a smooth distribution, so its expected reward changes smoothly as the action distribution changes.
Epsilon greedy only changes when mean reward of one arm passes the mean reward of another, which happens much more rarely than the UCB changes
Question 2) Scaling Rewards
Let’s say that we have 2 3-armed bandits:
For the same random seed, does epsilon-greedy take the same sequence of actions on either task? How about UCB?
Solution 2) Scaling Rewards
Epsilon-greedy is scale-invariant; only the maximum Q-value determines the policy. Therefore, it takes the same sequence of actions on both. UCB however, may give different results, as the upper confidence bounds will be the same in both cases but the Q values wont.
In fact, we would expect UCB to be more likely to switch actions frequently in the top case
Question 3) Markov Decision Processes
Solution 3) Markov Decision Processes
Question 4) Comparing SARSA and Q-Learning
Given some trajectory:
We can define an update target for Q-values at step 0:
Using V(s), apply bootstrapping for 2-step returns.
Do the same, but in terms of Q(s, a).
Do the same, assuming the policy takes optimal actions with respect to its Q values.
Question 5) Comparing SARSA and Q-Learning
At each timestep of the policy iteration algorithm, the expected reward of the current policy is guaranteed to improve or remain the same.
Answer 5) Value and Policy Iteration
True, this is the Policy Improvement Theorem.
Question 6) SARSA
Was SARSA designed to learn Q-values using samples from a replay buffer of transitions collected from old policies? What about expected SARSA?
Answer 6) SARSA
SARSA:
Expected SARSA:
SARSA is on-policy: A’ is supposed to be drawn from the policy. (It can be extended to use a replay buffer, but this isn’t in the original formulation). In contrast, expected SARSA is designed to use (s, a, s’, r) from any source to perform updates.
Question 7) MCTS & DQN
Which of the following statements about MCTS and DQNs is incorrect:
A) MCTS uses a tabular representations of action-values whereas DQNs uses a functional approximation.
B) Agents using DQNs are faster at choosing actions compared to those using MCTS.
C) MCTS gives the optimal Q value for the root state it is exploring and optimal action a
D) DQNs and MCTS are both on-policy.
E) MCTS have been shown to outperform comparable DQNs on some tasks.
Answer 7) MCTS & DQN
C) MCTS gives the optimal Q value for the root state it is exploring and optimal action a
D) DQNs and MCTS are both on-policy.
Questions?