Many Faces of Reinforcement Learning
Computer Science
Economics
Mathematics
Engineering
Neuroscience
Psychology
Machine Learning
Classical/Operant Conditioning
Optimal Control
Reward System
Operations Research
Bounded Rationality
Reinforcement Learning
Branches of Machine Learning
Reinforcement Learning
Supervised Learning
Unsupervised Learning
Machine Learning
Characteristics of Reinforcement Learning
What makes reinforcement learning different from other machine learning paradigms?
There is no supervisor, only a reward signal Feedback is delayed, not instantaneous
Time really matters (sequential)
Agent’s actions affect the subsequent data it receives
Examples of Reinforcement Learning
Fly stunt manoeuvres in a helicopter
Defeat the world champion at Backgammon Manage an investment portfolio
Control a power station
Make a humanoid robot walk
Play many different Atari games better than humans
Rewards
A reward Rt is a scalar feedback signal Indicates how well agent is doing at step t
The agent’s job is to maximise cumulative reward
Reinforcement learning is based on the reward hypothesis
Definition (Reward Hypothesis)
All goals can be described by the maximisation of expected cumulative reward
Examples of Rewards
Fly stunt manoeuvres in a helicopter
+ve reward for following desired trajectory
−ve reward for crashing
Defeat the world champion at Backgammon
+/−ve reward for winning/losing a game
Manage an investment portfolio
+ve reward for each $ in bank
Control a power station
+ve reward for producing power
−ve reward for exceeding safety thresholds
Make a humanoid robot walk
+ve reward for forward motion
−ve reward for falling over
Play many different Atari games better than humans
+/−ve reward for increasing/decreasing score
Sequential Decision Making
Goal: select actions to maximise total future reward
Actions may have long term consequences Reward may be delayed
It may be better to sacrifice immediate reward to gain more long-term reward
Examples:
A financial investment (may take months to mature)
Refuelling a helicopter (might prevent a crash in several hours) Blocking opponent moves (might help winning chances many moves from now)
Agent and Environment
observation
reward
action
At
Rt
Ot
Agent and Environment
observation
reward
action
At
Rt
Ot
At each step t the agent: Executes action At Receives observation Ot Receives scalar reward Rt
The environment:
Receives action At
Emits observation Ot+1
Emits scalar reward Rt+1
t increments at env. step
History and State
The history is the sequence of observations, actions, rewards
Ht = O1, R1, A1, ..., At−1, Ot, Rt
i.e. all observable variables up to time t
i.e. the sensorimotor stream of a robot or embodied agent What happens next depends on the history:
The agent selects actions
The environment selects observations/rewards
State is the information used to determine what happens next Formally, state is a function of the history:
St = f (Ht )
Environment State
observation
reward
action
At
Rt
Ot
e
environment state St
e
The environment state St is the environment’s private representation
i.e. whatever data the environment uses to pick the next observation/reward
The environment state is not usually visible to the agent
Even if Se is visible, it may
t
contain irrelevant
information
Agent State
observation
reward
action
At
Rt
Ot
t
agent state Sa
t
The agent state Sa is the
agent’s internal representation
i.e. whatever information the agent uses to pick the next action
i.e. it is the information used by reinforcement learning algorithms
It can be any function of history:
t
Sa = f (Ht )
Information State
An information state (a.k.a.Markov state) contains all useful information from the history.
Definition
A state St is Markov if and only if
P[St+1 | St ] = P[St+1 | S1, ..., St ]
“The future is independent of the past given the present”
H1:t → St → Ht+1:∞
Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future
The environment state Se is Markov
t
The history Ht is Markov
Fully Observable Environments
state
reward
action
At
Rt
St
Full observability: agentdirectly observes environment state
a e
Ot = St = St
Agent state = environment state = information state
Formally, this is a Markov decision process(MDP)
Partially Observable Environments
Partial observability: agent indirectly observes environment: A robot with camera vision isn’t told its absolute location
A trading agent only observes current prices
A poker playing agent only observes public cards
Now agent state ƒ= environment state
Formally this is a partially observable Markov decision process (POMDP)
t
Agent must construct its own state representation Sa, e.g.
t
Complete history: Sa = Ht
t t t
Beliefs of environment state: Sa = (P[Se = s1], ..., P[Se = sn ])
t t−1
Recurrent neural network: Sa = σ(Sa Ws + Ot Wo )
Major Components of an RL Agent
An RL agent may include one or more of these components:
Policy: agent’s behaviour function
Value function: how good is each state and/or action Model: agent’s representation of the environment
Policy
A policy is the agent’s behaviour
It is a map from state to action, e.g. Deterministic policy: a = π(s)
Stochastic policy: π(a|s) = P[At = a|St = s]
Value Function
Value function is a prediction of future reward Used to evaluate the goodness/badness of states And therefore to select between actions, e.g.
vπ (s) = Eπ ΣRt+1 + γRt+2 + γ2Rt+3 + ... | St = sΣ
Model
Amodelpredicts what the environment will do next
P predicts the next state
R predicts the next (immediate) reward, e.g.
ss
j
P = P[S
t+1
a j
t t
= s | S = s, A = a]
a
s
t+1 t
t
R = E [R | S = s, A = a]
Categorizing RL agents (1)
Value Based
No Policy (Implicit) Value Function
Policy Based
Policy
No Value Function
Actor Critic
Policy
Value Function
Categorizing RL agents (2)
Model Free
Policy and/or Value Function No Model
Model Based
Policy and/or Value Function Model
RL Agent Taxonomy
Model
Value Function
Policy
Actor Critic
Value-Based
Policy-Based
Model-Free
Model-Based
Learning and Planning
Two fundamental problems in sequential decision making Reinforcement Learning:
The environment is initially unknown
The agent interacts with the environment The agent improves its policy
Planning:
A model of the environment is known
The agent performs computations with its model (without any external interaction)
The agent improves its policy
a.k.a. deliberation, reasoning, introspection, pondering, thought, search
Exploration and Exploitation (1)
Reinforcement learning is like trial-and-error learning The agent should discover a good policy
From its experiences of the environment Without losing too much reward along the way
Exploration and Exploitation (2)
Exploration finds more information about the environment Exploitation exploits known information to maximise reward It is usually important to explore as well as exploit
Examples
Restaurant Selection
ExploitationGo to your favourite restaurant ExplorationTry a new restaurant
Online Banner Advertisements ExploitationShow the most successful advert ExplorationShow a different advert
Oil Drilling
ExploitationDrill at the best known location ExplorationDrill at a new location
Game Playing
ExploitationPlay the move you believe is best ExplorationPlay an experimental move
Prediction and Control
Prediction: evaluate the future
Given a policy
Control: optimise the future
Find the best policy
Introduction to MDPs
Markov decision processes formally describe an environment for reinforcement learning
Where the environment is fully observable
i.e. The current state completely characterises the process Almost all RL problems can be formalised as MDPs, e.g.
Optimal control primarily deals with continuous MDPs Partially observable problems can be converted into MDPs Bandits are MDPs with one state
Markov Property
“The future is independent of the past given the present”
Definition
A state St is Markov if and only if
P [St+1 | St ] = P [St+1 | S1, ..., St ]
The state captures all relevant information from the history Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future
State Transition Matrix
For a Markov state s and successor state sj, the state transition probability is defined by
Pssj = P ΣSt+1 = sj | St = sΣ
State transition matrix P defines transition probabilities from all states s to all successor states sj,
P = from
11
to
P . . . P
1n
.
.
Pn1
. . .
Pnn
where each row of the matrix sums to 1.
Markov Process
A Markov process is a memoryless random process, i.e. a sequence of random states S1, S2, ... with the Markov property.
Definition
A Markov Process (or Markov Chain) is a tuple (S, P) S is a (finite) set of states
P is a state transition probability matrix,
Pssj = P [St+1 = sj | St = s]
Markov Reward Process
A Markov reward process is a Markov chain with values. Definition
A Markov Reward Process is a tuple (S, P, R,γ) S is a finite set of states
P is a state transition probability matrix,
Pssj = P [St+1 = sj | St = s]
R is a reward function, Rs = E [Rt+1 | St = s]
γ is a discount factor, γ ∈ [0, 1]
Return
Definition
The return Gt is the total discounted reward from time-step t.
t+2
∞
Σ
k=0
k
Gt = Rt+1 + γR + ... = γ R
t+k+1
The discount γ ∈ [0, 1] is the present value of future rewards The value of receiving reward R after k + 1 time-steps is γk R. This values immediate reward above delayed reward.
γ close to 0 leads to ”myopic” evaluation
γ close to 1 leads to ”far-sighted” evaluation
Why discount?
Most Markov reward and decision processes are discounted. Why?
Mathematically convenient to discount rewards Avoids infinite returns in cyclic Markov processes
Uncertainty about the future may not be fully represented
If the reward is financial, immediate rewards may earn more interest than delayed rewards
Animal/human behaviour shows preference for immediate reward
It is sometimes possible to use undiscounted Markov reward processes (i.e. γ = 1), e.g. if all sequences terminate.
Value Function
The value function v (s) gives the long-term value of state s
Definition
The state value function v (s) of an MRP is the expected return starting from state s
v (s) = E [Gt | St = s]
Bellman Equation for MRPs
The value function can be decomposed into two parts: immediate reward Rt+1
discounted value of successor state γv (St+1)
v (s) = E [Gt | St = s]
= E R + γR
2
t+1 t+2 t+3 t
+ γ R + ... | S = s
Σ Σ
= E [Rt+1 + γ (Rt+2 + γRt+3 + ...) | St = s]
= E [Rt+1 + γGt+1 | St = s]
= E [Rt+1 + γv (St+1) | St = s]
Bellman Equation for MRPs (2)
v (s) = E [Rt+1 + γv (St+1) | St = s]
v(s) s
v(s0)
s0
Σ
s ss
sj∈S
j
v (s) = R + γ P jv (s )
Solving the Bellman Equation
The Bellman equation is a linear equation It can be solved directly:
v = R + γPv
(I − γP) v = R
v = (I − γP)−1 R
Computational complexity is O(n3) for n states Direct solution only possible for small MRPs
There are many iterative methods for large MRPs, e.g.
Dynamic programming Monte-Carlo evaluation Temporal-Difference learning
Markov Decision Process
A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov.
Definition
A Markov Decision Process is a tuple (S, A, P, R,γ) S is a finite set of states
A is a finite set of actions
P is a state transition probability matrix,
ss
j
a j
t+1 t t
P = P [S = s | S = s, A = a]
a
s
R is a reward function, R = E [R
t+1 t
t
| S = s, A = a]
γ is a discount factor γ ∈ [0, 1].
Policies (1)
Definition
A policy π is a distribution over actions given states,
π(a|s) = P [At = a | St = s]
A policy fully defines the behaviour of an agent
MDP policies depend on the current state (not the history)
i.e. Policies are stationary (time-independent),
At ∼ π(·|St ), ∀ t > 0
Value Function
Definition
The state-value function vπ (s) of an MDP is the expected return starting from state s, and then following policy π
vπ (s) = Eπ [Gt | St = s]
Definition
The action-value function qπ (s, a) is the expected return
starting from state s, taking action a, and then following policy π
qπ (s, a) = Eπ [Gt | St = s, At = a]
Bellman Expectation Equation
The state-value function can again be decomposed into immediate reward plus discounted value of successor state,
vπ (s) = Eπ [Rt+1 + γvπ (St+1) | St = s] The action-value function can similarly be decomposed,
qπ (s, a) = Eπ [Rt+1 + γqπ (St+1, At+1) | St = s, At = a]
Bellman Expectation Equation for V π
π
v⇡ (s) s
q⇡ (s, a) a
Σ
v (s) = π(a|s)q
a∈A
π
(s, a)
Bellman Expectation Equation for Qπ
v⇡ (s0)
s0
q⇡ (s, a) s, a
π
a
s
Σ
sj∈S
ss
j
π
a j
q (s, a) = R + γ P v (s )
Bellman Expectation Equation for vπ (2)
v⇡ (s0)
s0
v⇡ (s) s
π
Σ
.
a
s
Σ
a∈A sj∈S
ss
j
π
a j
v (s) = π(a|s) R + γ P v (s )
Σ
Bellman Expectation Equation for qπ (2)
q⇡ (s, a) s, a
s0
q⇡ (s0, a0)
a0
π
a
s
Σ
q (s, a) = R + γ P
sj∈S
a ss
j
Σ
aj∈A
j j
π(a |s )q
π
j j
(s , a )
Bellman Expectation Equation (Matrix Form)
The Bellman expectation equation can be expressed concisely using the induced MRP,
vπ = Rπ + γPπvπ
with direct solution
vπ = (I − γPπ )−1 Rπ
Optimal Value Function
Definition
The optimal state-value function v∗(s) is the maximum value function over all policies
π
∗ π
v (s) = max v (s)
The optimal action-value function q∗(s, a) is the maximum action-value function over all policies
π
∗ π
q (s, a) = max q (s, a)
The optimal value function specifies the best possible performance in the MDP.
An MDP is “solved” when we know the optimal value fn.
Optimal Policy
Define a partial ordering over policies
π ≥ π if vπ (s) ≥ vπj (s), ∀s
Theorem
For any Markov Decision Process
There exists an optimal policy π∗ that is better than or equal to all other policies, π∗ ≥ π, ∀π
All optimal policies achieve the optimal value function, vπ∗ (s) = v∗(s)
All optimal policies achieve the optimal action-value function, qπ∗ (s, a) = q∗(s, a)
Finding an Optimal Policy
An optimal policy can be found by maximising over q∗(s, a),
π∗(a|s) =
.
a∈A
1 if a = argmax q∗(s, a)
0 otherwise
There is always a deterministic optimal policy for any MDP If we know q∗(s, a), we immediately have the optimal policy
Bellman Optimality Equation for v∗
The optimal value functions are recursively related by the Bellman optimality equations:
v⇤(s) s
q⇤(s, a) a
a
v∗(s) = max q∗(s, a)
Bellman Optimality Equation for Q∗
q⇤(s, a) s, a
v⇤(s0)
s0
∗
a
s
Σ
sj∈S
ss
j
∗
a j
q (s, a) = R + γ P v (s )
Solving the Bellman Optimality Equation
Bellman Optimality Equation is non-linear No closed form solution (in general)
Many iterative solution methods
Value Iteration Policy Iteration Q-learning Sarsa
Lecture 3: Planning by Dynamic Programming
What is Dynamic Programming?
Dynamicsequential or temporal component to the problem Programmingoptimising a “program”, i.e. a policy
c.f. linear programming
A method for solving complex problems By breaking them down into subproblems
Solve the subproblems
Combine solutions to subproblems
Lecture 3: Planning by Dynamic Programming
Requirements for Dynamic Programming
Dynamic Programming is a very general solution method for problems which have two properties:
Optimal substructure
Principle of optimality applies
Optimal solution can be decomposed into subproblems
Overlapping subproblems
Subproblems recur many times Solutions can be cached and reused
Markov decision processes satisfy both properties Bellman equation gives recursive decomposition Value function stores and reuses solutions
Lecture 3: Planning by Dynamic Programming
Planning by Dynamic Programming
Dynamic programming assumes full knowledge of the MDP It is used for planning in an MDP
For prediction:
Input: MDP (S, A, P, R, γ) and policy π
or: MRP (S, Pπ, Rπ, γ)
Output: value function vπ
Or for control:
Input: MDP (S, A, P, R, γ)
Output: optimal value function v∗
and: optimal policy π∗
Lecture 3: Planning by Dynamic Programming
Other Applications of Dynamic Programming
Dynamic programming is used to solve many other problems, e.g.
Scheduling algorithms
String algorithms (e.g. sequence alignment) Graph algorithms (e.g. shortest path algorithms) Graphical models (e.g. Viterbi algorithm) Bioinformatics (e.g. lattice models)
Lecture 3: Planning by Dynamic Programming
Iterative Policy Evaluation
Problem: evaluate a given policy π
Solution: iterative application of Bellman expectation backup
v1 → v2 → ... → vπ
Using synchronous backups, At each iteration k + 1 For all states s ∈ S
Update vk+1(s) from vk (sj) where sj is a successor state of s
We will discuss asynchronous backups later
Convergence to vπ will be proven at the end of the lecture
Lecture 3: Planning by Dynamic Programming
Iterative Policy Evaluation (2)
vk+1(s) [ s
vk (s0) [ s0
k+1
Σ
.
a
s
Σ
ss
j
k
a j
v (s) = π(a|s)(R + γ P v (s ))
a∈A sj∈S
vk+1 = Rπ + γPπ vk
Lecture 3: Planning by Dynamic Programming
Evaluating a Random Policy in the Small Gridworld
Undiscounted episodic MDP (γ = 1) Nonterminal states 1, ..., 14
One terminal state (shown twice as shaded squares) Actions leading out of the grid leave state unchanged Reward is −1 until the terminal state is reached Agent follows uniform random policy
π(n|·) = π(e|·) = π(s|·) = π(w|·) = 0.25
Lecture 3: Planning by Dynamic Programming
Iterative Policy Evaluation in Small Gridworld
0.0 | 0.0 | 0.0 | 0.0 |
0.0 | 0.0 | 0.0 | 0.0 |
0.0 | 0.0 | 0.0 | 0.0 |
0.0 | 0.0 | 0.0 | 0.0 |
| | | |
| | | |
| | | |
| | | |
0.0 | -1.0 | -1.0 | -1.0 |
-1.0 | -1.0 | -1.0 | -1.0 |
-1.0 | -1.0 | -1.0 | -1.0 |
-1.0 | -1.0 | -1.0 | 0.0 |
0.0 | -1.7 | -2.0 | -2.0 |
-1.7 | -2.0 | -2.0 | -2.0 |
-2.0 | -2.0 | -2.0 | -1.7 |
-2.0 | -2.0 | -1.7 | 0.0 |
Vk
Vk
k = 0
k = 1
k = 2
random
policy
vk for the Random Policy
Greedy Policy
w.r.t. vk
Lecture 3: Planning by Dynamic Programming
Iterative Policy Evaluation in Small Gridworld (2)
0.0 | -2.4 | -2.9 | -3.0 |
-2.4 | -2.9 | -3.0 | -2.9 |
-2.9 | -3.0 | -2.9 | -2.4 |
-3.0 | -2.9 | -2.4 | 0.0 |
0.0 | -6.1 | -8.4 | -9.0 |
-6.1 | -7.7 | -8.4 | -8.4 |
-8.4 | -8.4 | -7.7 | -6.1 |
-9.0 | -8.4 | -6.1 | 0.0 |
0.0 | -14. | -20. | -22. |
-14. | -18. | -20. | -20. |
-20. | -20. | -18. | -14. |
-22. | -20. | -14. | 0.0 |
k = 10
°
k = 3
optimal
policy
k = ∞
Lecture 3: Planning by Dynamic Programming
How to Improve a Policy
Given a policy π
Evaluatethe policy π
vπ (s) = E [Rt+1 + γRt+2 + ...|St = s]
Improvethe policy by acting greedily with respect to vπ
πj = greedy(vπ )
In Small Gridworld improved policy was optimal, πj = π∗
In general, need more iterations of improvement / evaluation But this process ofpolicy iterationalways converges to π∗
Lecture 3: Planning by Dynamic Programming
Policy Iteration
Policy evaluation Estimate vπ
Iterative policy evaluation
Policy improvement Generate πj ≥ π
Greedy policy improvement
Lecture 3: Planning by Dynamic Programming
Policy Improvement
Consider a deterministic policy, a = π(s)
We can improve the policy by acting greedily
πj(s) = argmax qπ (s, a)
a∈A
This improves the value from any state s over one step,
qπ (s, πj(s)) = max qπ (s, a) ≥ qπ (s, π(s)) = vπ (s)
a∈A
It therefore improves the value function, vπj (s) ≥ vπ (s)
vπ (s) ≤ qπ (s, πj(s)) = Eπj [Rt+1 + γvπ (St+1) | St = s]
π
j
t+1 π t+1 t+1 t
≤ E j R + γq (S , π (S )) | S = s
Σ
Σ
≤ Eπj Rt+1 + γRt+2 + γ2qπ (St+2, πj(St+2)) | St = s
Σ
≤ Eπj [Rt+1 + γRt+2 + ... | St = s] = vπj (s)
Lecture 3: Planning by Dynamic Programming
Policy Improvement (2)
If improvements stop,
qπ (s, πj(s)) = max qπ (s, a) = qπ (s, π(s)) = vπ (s)
a∈A
Then the Bellman optimality equation has been satisfied
vπ (s) = max qπ (s, a)
a∈A
Therefore vπ (s) = v∗(s) for all s ∈ S
so π is an optimal policy
Lecture 3: Planning by Dynamic Programming
Modified Policy Iteration
Does policy evaluation need to converge to vπ ?
Or should we introduce a stopping condition
e.g. s-convergence of value function
Or simply stop after k iterations of iterative policy evaluation?
For example, in the small gridworld k = 3 was sufficient to achieve optimal policy
Why not update policy every iteration? i.e. stop after k = 1
This is equivalent to value iteration (next section)
Lecture 3: Planning by Dynamic Programming
Generalised Policy Iteration
Policy evaluationEstimate vπ
Anypolicy evaluation algorithm
Policy improvementGenerate πj ≥ π
Anypolicy improvement algorithm
Lecture 3: Planning by Dynamic Programming
Principle of Optimality
Any optimal policy can be subdivided into two components: An optimal first action A∗
Followed by an optimal policy from successor state Sj
Theorem (Principle of Optimality)
A policy π(a|s) achieves the optimal value from state s, vπ (s) = v∗(s), if and only if
For any state sj reachable from s
π achieves the optimal value from state sj, vπ (sj) = v∗(sj)
Lecture 3: Planning by Dynamic Programming
Value Iteration
Problem: find optimal policy π
Solution: iterative application of Bellman optimality backup
v1 → v2 → ... → v∗
Using synchronous backups At each iteration k + 1 For all states s ∈ S
Update vk+1(s) from vk (s')
Convergence to v∗ will be proven later
Unlike policy iteration, there is no explicit policy
Intermediate value functions may not correspond to any policy
Lecture 3: Planning by Dynamic Programming
Value Iteration (2)
vk+1(s) [ s
vk (s0) [ s0
k+1
a∈A
.
Σ
sj∈S
ss
j
k
a j
v (s) = max(R + γ P v (s ))
a∈A
a a
vk+1 = max R + γP vk
Lecture 3: Planning by Dynamic Programming
Asynchronous Dynamic Programming
DP methods described so far used synchronous backups
i.e. all states are backed up in parallel
Asynchronous DP backs up states individually, in any order For each selected state, apply the appropriate backup
Can significantly reduce computation
Guaranteed to converge if all states continue to be selected
Lecture 3: Planning by Dynamic Programming
Some Technical Questions
How do we know that value iteration converges to v∗? Or that iterative policy evaluation converges to vπ ?
And therefore that policy iteration converges to v∗? Is the solution unique?
How fast do these algorithms converge?
These questions are resolved by contraction mapping theorem
Lecture 4: Model-Free Prediction
Monte-Carlo Reinforcement Learning
MC methods learn directly from episodes of experience
MC is model-free: no knowledge of MDP transitions / rewards MC learns from complete episodes: no bootstrapping
MC uses the simplest possible idea: value = mean return Caveat: can only apply MC to episodic MDPs
All episodes must terminate
Lecture 4: Model-Free Prediction
Monte-Carlo Policy Evaluation
Goal: learn vπ from episodes of experience under policy π
S1, A1, R2, ..., Sk ∼ π
Recall that the return is the total discounted reward:
Gt = Rt+1 + γRt+2 + ... + γT−1RT
Recall that the value function is the expected return:
vπ (s) = Eπ [Gt | St = s]
Monte-Carlo policy evaluation uses empirical mean return instead of expected return
Lecture 4: Model-Free Prediction
First-Visit Monte-Carlo Policy Evaluation
To evaluate state s
The first time-step t that state s is visited in an episode, Increment counter N(s) ← N(s) + 1
Increment total return S (s) ← S (s) + Gt
Value is estimated by mean return V (s) = S (s)/N(s) By law of large numbers, V (s) → vπ (s) as N(s) → ∞
Lecture 4: Model-Free Prediction
Every-Visit Monte-Carlo Policy Evaluation
To evaluate state s
Every time-step t that state s is visited in an episode, Increment counter N(s) ← N(s) + 1
Increment total return S (s) ← S (s) + Gt
Value is estimated by mean return V (s) = S (s)/N(s) Again, V (s) → vπ (s) as N(s) → ∞
Lecture 4: Model-Free Prediction
Incremental Mean
The mean µ1, µ2, ... of a sequence x1, x2, ... can be computed incrementally,
k
µ =
1
k
k
Σ
j=1
x
j
1
1
= (xk + (k − 1)µk−1)
k
= µk−1 + k (xk − µk−1)
Lecture 4: Model-Free Prediction
Incremental Monte-Carlo Updates
Update V (s) incrementally after episode S1, A1, R2, ..., ST
For each state St with return Gt
N(St ) ← N(St ) + 1
V (St ) ← V (St ) +
1
t
N(S )
(Gt − V (St ))
In non-stationary problems, it can be useful to track a running mean, i.e. forget old episodes.
V (St ) ← V (St ) + α (Gt − V (St ))
Lecture 4: Model-Free Prediction
Temporal-Difference Learning
TD methods learn directly from episodes of experience
TD is model-free: no knowledge of MDP transitions / rewards TD learns from incomplete episodes, by bootstrapping
TD updates a guess towards a guess
Lecture 4: Model-Free Prediction
MC and TD
Goal: learn vπ online from experience under policy π
Incremental every-visit Monte-Carlo
Update value V (St ) toward actual return Gt
V (St ) ← V (St ) + α (Gt − V (St ))
Simplest temporal-difference learning algorithm: TD(0)
Update value V (St ) toward estimated return Rt+1 + γV (St+1)
V (St ) ← V (St ) + α (Rt+1 + γV (St+1) − V (St ))
Rt+1 + γV (St+1) is called the TD target
δt = Rt+1 + γV (St+1) − V (St ) is called the TD error
Lecture 4: Model-Free Prediction
Driving Home Example
State | Elapsed Time | Predicted | Predicted |
leaving office | (minutes) 0 | Time to Go 30 | Total Time 30 |
reach car, raining | 5 | 35 | 40 |
exit highway | 20 | 15 | 35 |
behind truck | 30 | 10 | 40 |
home street | 40 | 3 | 43 |
arrive home | 43 | 0 | 43 |
Lecture 4: Model-Free Prediction
Driving Home Example: MC vs. TD
Changes recommended by Monte Carlo methods (!=1)
Changes recommended by TD methods (!=1)
Lecture 4: Model-Free Prediction
Advantages and Disadvantages of MC vs. TD
TD can learn before knowing the final outcome
TD can learn online after every step
MC must wait until end of episode before return is known
TD can learn without the final outcome
TD can learn from incomplete sequences MC can only learn from complete sequences
TD works in continuing (non-terminating) environments MC only works for episodic (terminating) environments
Lecture 4: Model-Free Prediction
Bias/Variance Trade-Off
Return Gt = Rt+1 + γRt+2 + ... + γT−1RT is unbiased
estimate of vπ (St )
True TD target Rt+1 + γvπ (St+1) is unbiased estimate of
vπ (St )
TD target Rt+1 + γV (St+1) is biased estimate of vπ (St ) TD target is much lower variance than the return:
Return depends on many random actions, transitions, rewards TD target depends on one random action, transition, reward
Lecture 4: Model-Free Prediction
Advantages and Disadvantages of MC vs. TD (2)
MC has high variance, zero bias
Good convergence properties
(even with function approximation) Not very sensitive to initial value Very simple to understand and use
TD has low variance, some bias Usually more efficient than MC TD(0) converges to vπ (s)
(but not always with function approximation) More sensitive to initial value
Lecture 4: Model-Free Prediction
Batch MC and TD
MC and TD converge: V (s) → vπ (s) as experience → ∞
But what about batch solution for finite experience?
s1, a1, r 1, ..., s1
1 1 2 T1
.
.
sK , aK , r K , ..., sK
1 1 2 TK
e.g. Repeatedly sample episode k ∈ [1, K ] Apply MC or TD(0) to episode k
Lecture 4: Model-Free Prediction
AB Example
Two states A, B; no discounting; 8 episodes of experience
A, 0, B, 0
B, 1
B, 1
B, 1
B, 1
B, 1
B, 1
B, 0
What is V (A), V (B)?
Lecture 4: Model-Free Prediction
AB Example
Two states A, B; no discounting; 8 episodes of experience
A, 0, B, 0
B, 1
B, 1
B, 1
B, 1
B, 1
B, 1
B, 0
What is V (A), V (B)?
Lecture 4: Model-Free Prediction
Advantages and Disadvantages of MC vs. TD (3)
TD exploits Markov property
Usually more efficient in Markov environments
MC does not exploit Markov property
Usually more effective in non-Markov environments
Lecture 4: Model-Free Prediction
Monte-Carlo Backup
T
T
T
T
T
T
T
T
T
T
V (St ) ← V (St ) + α (Gt − V (St ))
st
T
T
T
T
T
T
T
T
T
T
Lecture 4: Model-Free Prediction
Temporal-Difference Backup
T
T
T
T
T
T
T
T
T
T
st +1
rt +1
V (St ) ← V (St ) + α (Rt+1 + γV (St+1) − V (St ))
st
T
T
T
T
T
T
T
T
T
T
Lecture 4: Model-Free Prediction
Dynamic Programming Backup
T
T
T
T
V (St ) ← Eπ [Rt+1 + γV (St+1)]
st
rt +1
st +1
T
T
T
T
T
T
T
T
T
Lecture 4: Model-Free Prediction
Bootstrapping and Sampling
Bootstrapping: update involves an estimate
MC does not bootstrap DP bootstraps
TD bootstraps
Sampling: update samples an expectation
MC samples
DP does not sample TD samples
Lecture 4: Model-Free Prediction
Unified View of Reinforcement Learning
Lecture 4: Model-Free Prediction
n-Step Prediction
Let TD target look n steps into the future
Lecture 4: Model-Free Prediction
n-Step Return
Consider the following n-step returns for n = 1, 2, ∞:
t
n = 1( TD) G (1) = Rt+1 + γV (St+1)
n = 2
t
G (2) = Rt+1 + γRt+2 + γ2V (St+2)
. .
.
t
.
(∞)
n = ∞ (MC ) G = Rt+1 + γRt+2 + ... + γT−1RT
Define the n-step return
t
G (n) = Rt+1 + γRt+2 + ... + γn−1Rt+n + γnV (St+n)
n-step temporal-difference learning
.
(n)
t t t t
V (S ) ← V (S ) + α G − V (S )
Σ