1 of 99

2 of 99

Many Faces of Reinforcement Learning

Computer Science

Economics

Mathematics

Engineering

Neuroscience

Psychology

Machine Learning

Classical/Operant Conditioning

Optimal Control

Reward System

Operations Research

Bounded Rationality

Reinforcement Learning

3 of 99

Branches of Machine Learning

Reinforcement Learning

Supervised Learning

Unsupervised Learning

Machine Learning

4 of 99

Characteristics of Reinforcement Learning

What makes reinforcement learning different from other machine learning paradigms?

There is no supervisor, only a reward signal Feedback is delayed, not instantaneous

Time really matters (sequential)

Agent’s actions affect the subsequent data it receives

5 of 99

Examples of Reinforcement Learning

Fly stunt manoeuvres in a helicopter

Defeat the world champion at Backgammon Manage an investment portfolio

Control a power station

Make a humanoid robot walk

Play many different Atari games better than humans

6 of 99

Rewards

A reward Rt is a scalar feedback signal Indicates how well agent is doing at step t

The agent’s job is to maximise cumulative reward

Reinforcement learning is based on the reward hypothesis

Definition (Reward Hypothesis)

All goals can be described by the maximisation of expected cumulative reward

7 of 99

Examples of Rewards

Fly stunt manoeuvres in a helicopter

+ve reward for following desired trajectory

ve reward for crashing

Defeat the world champion at Backgammon

+/−ve reward for winning/losing a game

Manage an investment portfolio

+ve reward for each $ in bank

Control a power station

+ve reward for producing power

ve reward for exceeding safety thresholds

Make a humanoid robot walk

+ve reward for forward motion

ve reward for falling over

Play many different Atari games better than humans

+/−ve reward for increasing/decreasing score

8 of 99

Sequential Decision Making

Goal: select actions to maximise total future reward

Actions may have long term consequences Reward may be delayed

It may be better to sacrifice immediate reward to gain more long-term reward

Examples:

A financial investment (may take months to mature)

Refuelling a helicopter (might prevent a crash in several hours) Blocking opponent moves (might help winning chances many moves from now)

9 of 99

Agent and Environment

observation

reward

action

At

Rt

Ot

10 of 99

Agent and Environment

observation

reward

action

At

Rt

Ot

At each step t the agent: Executes action At Receives observation Ot Receives scalar reward Rt

The environment:

Receives action At

Emits observation Ot+1

Emits scalar reward Rt+1

t increments at env. step

11 of 99

History and State

The history is the sequence of observations, actions, rewards

Ht = O1, R1, A1, ..., At1, Ot, Rt

i.e. all observable variables up to time t

i.e. the sensorimotor stream of a robot or embodied agent What happens next depends on the history:

The agent selects actions

The environment selects observations/rewards

State is the information used to determine what happens next Formally, state is a function of the history:

St = f (Ht )

12 of 99

Environment State

observation

reward

action

At

Rt

Ot

e

environment state St

e

The environment state St is the environment’s private representation

i.e. whatever data the environment uses to pick the next observation/reward

The environment state is not usually visible to the agent

Even if Se is visible, it may

t

contain irrelevant

information

13 of 99

Agent State

observation

reward

action

At

Rt

Ot

t

agent state Sa

t

The agent state Sa is the

agent’s internal representation

i.e. whatever information the agent uses to pick the next action

i.e. it is the information used by reinforcement learning algorithms

It can be any function of history:

t

Sa = f (Ht )

14 of 99

Information State

An information state (a.k.a.Markov state) contains all useful information from the history.

Definition

A state St is Markov if and only if

P[St+1 | St ] = P[St+1 | S1, ..., St ]

“The future is independent of the past given the present”

H1:t St Ht+1:

Once the state is known, the history may be thrown away

i.e. The state is a sufficient statistic of the future

The environment state Se is Markov

t

The history Ht is Markov

15 of 99

Fully Observable Environments

state

reward

action

At

Rt

St

Full observability: agentdirectly observes environment state

a e

Ot = St = St

Agent state = environment state = information state

Formally, this is a Markov decision process(MDP)

16 of 99

Partially Observable Environments

Partial observability: agent indirectly observes environment: A robot with camera vision isn’t told its absolute location

A trading agent only observes current prices

A poker playing agent only observes public cards

Now agent state ƒ= environment state

Formally this is a partially observable Markov decision process (POMDP)

t

Agent must construct its own state representation Sa, e.g.

t

Complete history: Sa = Ht

t t t

Beliefs of environment state: Sa = (P[Se = s1], ..., P[Se = sn ])

t t1

Recurrent neural network: Sa = σ(Sa Ws + Ot Wo )

17 of 99

Major Components of an RL Agent

An RL agent may include one or more of these components:

Policy: agent’s behaviour function

Value function: how good is each state and/or action Model: agent’s representation of the environment

18 of 99

Policy

A policy is the agent’s behaviour

It is a map from state to action, e.g. Deterministic policy: a = π(s)

Stochastic policy: π(a|s) = P[At = a|St = s]

19 of 99

Value Function

Value function is a prediction of future reward Used to evaluate the goodness/badness of states And therefore to select between actions, e.g.

vπ (s) = Eπ ΣRt+1 + γRt+2 + γ2Rt+3 + ... | St = sΣ

20 of 99

Model

Amodelpredicts what the environment will do next

P predicts the next state

R predicts the next (immediate) reward, e.g.

ss

j

P = P[S

t+1

a j

t t

= s | S = s, A = a]

a

s

t+1 t

t

R = E [R | S = s, A = a]

21 of 99

Categorizing RL agents (1)

Value Based

No Policy (Implicit) Value Function

Policy Based

Policy

No Value Function

Actor Critic

Policy

Value Function

22 of 99

Categorizing RL agents (2)

Model Free

Policy and/or Value Function No Model

Model Based

Policy and/or Value Function Model

23 of 99

RL Agent Taxonomy

Model

Value Function

Policy

Actor Critic

Value-Based

Policy-Based

Model-Free

Model-Based

24 of 99

Learning and Planning

Two fundamental problems in sequential decision making Reinforcement Learning:

The environment is initially unknown

The agent interacts with the environment The agent improves its policy

Planning:

A model of the environment is known

The agent performs computations with its model (without any external interaction)

The agent improves its policy

a.k.a. deliberation, reasoning, introspection, pondering, thought, search

25 of 99

Exploration and Exploitation (1)

Reinforcement learning is like trial-and-error learning The agent should discover a good policy

From its experiences of the environment Without losing too much reward along the way

26 of 99

Exploration and Exploitation (2)

Exploration finds more information about the environment Exploitation exploits known information to maximise reward It is usually important to explore as well as exploit

27 of 99

Examples

Restaurant Selection

ExploitationGo to your favourite restaurant ExplorationTry a new restaurant

Online Banner Advertisements ExploitationShow the most successful advert ExplorationShow a different advert

Oil Drilling

ExploitationDrill at the best known location ExplorationDrill at a new location

Game Playing

ExploitationPlay the move you believe is best ExplorationPlay an experimental move

28 of 99

Prediction and Control

Prediction: evaluate the future

Given a policy

Control: optimise the future

Find the best policy

29 of 99

Introduction to MDPs

Markov decision processes formally describe an environment for reinforcement learning

Where the environment is fully observable

i.e. The current state completely characterises the process Almost all RL problems can be formalised as MDPs, e.g.

Optimal control primarily deals with continuous MDPs Partially observable problems can be converted into MDPs Bandits are MDPs with one state

30 of 99

Markov Property

“The future is independent of the past given the present”

Definition

A state St is Markov if and only if

P [St+1 | St ] = P [St+1 | S1, ..., St ]

The state captures all relevant information from the history Once the state is known, the history may be thrown away

i.e. The state is a sufficient statistic of the future

31 of 99

State Transition Matrix

For a Markov state s and successor state sj, the state transition probability is defined by

Pssj = P ΣSt+1 = sj | St = sΣ

State transition matrix P defines transition probabilities from all states s to all successor states sj,

P = from

11

to

P . . . P

1n

.

.

Pn1

. . .

Pnn

where each row of the matrix sums to 1.

32 of 99

Markov Process

A Markov process is a memoryless random process, i.e. a sequence of random states S1, S2, ... with the Markov property.

Definition

A Markov Process (or Markov Chain) is a tuple (S, P) S is a (finite) set of states

P is a state transition probability matrix,

Pssj = P [St+1 = sj | St = s]

33 of 99

Markov Reward Process

A Markov reward process is a Markov chain with values. Definition

A Markov Reward Process is a tuple (S, P, R,γ) S is a finite set of states

P is a state transition probability matrix,

Pssj = P [St+1 = sj | St = s]

R is a reward function, Rs = E [Rt+1 | St = s]

γ is a discount factor, γ ∈ [0, 1]

34 of 99

Return

Definition

The return Gt is the total discounted reward from time-step t.

t+2

Σ

k=0

k

Gt = Rt+1 + γR + ... = γ R

t+k+1

The discount γ ∈ [0, 1] is the present value of future rewards The value of receiving reward R after k + 1 time-steps is γk R. This values immediate reward above delayed reward.

γ close to 0 leads to ”myopic” evaluation

γ close to 1 leads to ”far-sighted” evaluation

35 of 99

Why discount?

Most Markov reward and decision processes are discounted. Why?

Mathematically convenient to discount rewards Avoids infinite returns in cyclic Markov processes

Uncertainty about the future may not be fully represented

If the reward is financial, immediate rewards may earn more interest than delayed rewards

Animal/human behaviour shows preference for immediate reward

It is sometimes possible to use undiscounted Markov reward processes (i.e. γ = 1), e.g. if all sequences terminate.

36 of 99

Value Function

The value function v (s) gives the long-term value of state s

Definition

The state value function v (s) of an MRP is the expected return starting from state s

v (s) = E [Gt | St = s]

37 of 99

Bellman Equation for MRPs

The value function can be decomposed into two parts: immediate reward Rt+1

discounted value of successor state γv (St+1)

v (s) = E [Gt | St = s]

= E R + γR

2

t+1 t+2 t+3 t

+ γ R + ... | S = s

Σ Σ

= E [Rt+1 + γ (Rt+2 + γRt+3 + ...) | St = s]

= E [Rt+1 + γGt+1 | St = s]

= E [Rt+1 + γv (St+1) | St = s]

38 of 99

Bellman Equation for MRPs (2)

v (s) = E [Rt+1 + γv (St+1) | St = s]

v(s) s

v(s0)

s0

Σ

s ss

sj∈S

j

v (s) = R + γ P jv (s )

39 of 99

Solving the Bellman Equation

The Bellman equation is a linear equation It can be solved directly:

v = R + γPv

(I − γP) v = R

v = (I − γP)1 R

Computational complexity is O(n3) for n states Direct solution only possible for small MRPs

There are many iterative methods for large MRPs, e.g.

Dynamic programming Monte-Carlo evaluation Temporal-Difference learning

40 of 99

Markov Decision Process

A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov.

Definition

A Markov Decision Process is a tuple (S, A, P, R,γ) S is a finite set of states

A is a finite set of actions

P is a state transition probability matrix,

ss

j

a j

t+1 t t

P = P [S = s | S = s, A = a]

a

s

R is a reward function, R = E [R

t+1 t

t

| S = s, A = a]

γ is a discount factor γ ∈ [0, 1].

41 of 99

Policies (1)

Definition

A policy π is a distribution over actions given states,

π(a|s) = P [At = a | St = s]

A policy fully defines the behaviour of an agent

MDP policies depend on the current state (not the history)

i.e. Policies are stationary (time-independent),

At ∼ π(·|St ), ∀ t > 0

42 of 99

Value Function

Definition

The state-value function vπ (s) of an MDP is the expected return starting from state s, and then following policy π

vπ (s) = Eπ [Gt | St = s]

Definition

The action-value function qπ (s, a) is the expected return

starting from state s, taking action a, and then following policy π

qπ (s, a) = Eπ [Gt | St = s, At = a]

43 of 99

Bellman Expectation Equation

The state-value function can again be decomposed into immediate reward plus discounted value of successor state,

vπ (s) = Eπ [Rt+1 + γvπ (St+1) | St = s] The action-value function can similarly be decomposed,

qπ (s, a) = Eπ [Rt+1 + γqπ (St+1, At+1) | St = s, At = a]

44 of 99

Bellman Expectation Equation for V π

π

v(s) s

q(s, a) a

Σ

v (s) = π(a|s)q

a∈A

π

(s, a)

45 of 99

Bellman Expectation Equation for Qπ

v(s0)

s0

q(s, a) s, a

π

a

s

Σ

sj∈S

ss

j

π

a j

q (s, a) = R + γ P v (s )

46 of 99

Bellman Expectation Equation for vπ (2)

v(s0)

s0

v(s) s

π

Σ

.

a

s

Σ

a∈A sj∈S

ss

j

π

a j

v (s) = π(a|s) R + γ P v (s )

Σ

47 of 99

Bellman Expectation Equation for qπ (2)

q(s, a) s, a

s0

q(s0, a0)

a0

π

a

s

Σ

q (s, a) = R + γ P

sj∈S

a ss

j

Σ

aj∈A

j j

π(a |s )q

π

j j

(s , a )

48 of 99

Bellman Expectation Equation (Matrix Form)

The Bellman expectation equation can be expressed concisely using the induced MRP,

vπ = Rπ + γPπvπ

with direct solution

vπ = (I − γPπ )1 Rπ

49 of 99

Optimal Value Function

Definition

The optimal state-value function v(s) is the maximum value function over all policies

π

π

v (s) = max v (s)

The optimal action-value function q(s, a) is the maximum action-value function over all policies

π

π

q (s, a) = max q (s, a)

The optimal value function specifies the best possible performance in the MDP.

An MDP is “solved” when we know the optimal value fn.

50 of 99

Optimal Policy

Define a partial ordering over policies

π ≥ π if vπ (s) vπj (s), ∀s

Theorem

For any Markov Decision Process

There exists an optimal policy πthat is better than or equal to all other policies, π≥ π, ∀π

All optimal policies achieve the optimal value function, vπ(s) = v(s)

All optimal policies achieve the optimal action-value function, qπ(s, a) = q(s, a)

51 of 99

Finding an Optimal Policy

An optimal policy can be found by maximising over q(s, a),

π(a|s) =

.

a∈A

1 if a = argmax q(s, a)

0 otherwise

There is always a deterministic optimal policy for any MDP If we know q(s, a), we immediately have the optimal policy

52 of 99

Bellman Optimality Equation for v

The optimal value functions are recursively related by the Bellman optimality equations:

v(s) s

q(s, a) a

a

v(s) = max q(s, a)

53 of 99

Bellman Optimality Equation for Q

q(s, a) s, a

v(s0)

s0

a

s

Σ

sj∈S

ss

j

a j

q (s, a) = R + γ P v (s )

54 of 99

Solving the Bellman Optimality Equation

Bellman Optimality Equation is non-linear No closed form solution (in general)

Many iterative solution methods

Value Iteration Policy Iteration Q-learning Sarsa

55 of 99

Lecture 3: Planning by Dynamic Programming

What is Dynamic Programming?

Dynamicsequential or temporal component to the problem Programmingoptimising a “program”, i.e. a policy

c.f. linear programming

A method for solving complex problems By breaking them down into subproblems

Solve the subproblems

Combine solutions to subproblems

56 of 99

Lecture 3: Planning by Dynamic Programming

Requirements for Dynamic Programming

Dynamic Programming is a very general solution method for problems which have two properties:

Optimal substructure

Principle of optimality applies

Optimal solution can be decomposed into subproblems

Overlapping subproblems

Subproblems recur many times Solutions can be cached and reused

Markov decision processes satisfy both properties Bellman equation gives recursive decomposition Value function stores and reuses solutions

57 of 99

Lecture 3: Planning by Dynamic Programming

Planning by Dynamic Programming

Dynamic programming assumes full knowledge of the MDP It is used for planning in an MDP

For prediction:

Input: MDP (S, A, P, R, γ) and policy π

or: MRP (S, Pπ, Rπ, γ)

Output: value function vπ

Or for control:

Input: MDP (S, A, P, R, γ)

Output: optimal value function v

and: optimal policy π

58 of 99

Lecture 3: Planning by Dynamic Programming

Other Applications of Dynamic Programming

Dynamic programming is used to solve many other problems, e.g.

Scheduling algorithms

String algorithms (e.g. sequence alignment) Graph algorithms (e.g. shortest path algorithms) Graphical models (e.g. Viterbi algorithm) Bioinformatics (e.g. lattice models)

59 of 99

Lecture 3: Planning by Dynamic Programming

Iterative Policy Evaluation

Problem: evaluate a given policy π

Solution: iterative application of Bellman expectation backup

v1 v2 → ... → vπ

Using synchronous backups, At each iteration k + 1 For all states s ∈ S

Update vk+1(s) from vk (sj) where sj is a successor state of s

We will discuss asynchronous backups later

Convergence to vπ will be proven at the end of the lecture

60 of 99

Lecture 3: Planning by Dynamic Programming

Iterative Policy Evaluation (2)

vk+1(s) [ s

vk (s0) [ s0

k+1

Σ

.

a

s

Σ

ss

j

k

a j

v (s) = π(a|s)(R + γ P v (s ))

a∈A sj∈S

vk+1 = Rπ + γPπ vk

61 of 99

Lecture 3: Planning by Dynamic Programming

Evaluating a Random Policy in the Small Gridworld

Undiscounted episodic MDP (γ = 1) Nonterminal states 1, ..., 14

One terminal state (shown twice as shaded squares) Actions leading out of the grid leave state unchanged Reward is 1 until the terminal state is reached Agent follows uniform random policy

π(n) = π(e) = π(s) = π(w) = 0.25

62 of 99

Lecture 3: Planning by Dynamic Programming

Iterative Policy Evaluation in Small Gridworld

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

-1.0

-1.0

-1.0

-1.0

-1.0

-1.0

-1.0

-1.0

-1.0

-1.0

-1.0

-1.0

-1.0

-1.0

0.0

0.0

-1.7

-2.0

-2.0

-1.7

-2.0

-2.0

-2.0

-2.0

-2.0

-2.0

-1.7

-2.0

-2.0

-1.7

0.0

Vk

Vk

k = 0

k = 1

k = 2

random

policy

vk for the Random Policy

Greedy Policy

w.r.t. vk

63 of 99

Lecture 3: Planning by Dynamic Programming

Iterative Policy Evaluation in Small Gridworld (2)

0.0

-2.4

-2.9

-3.0

-2.4

-2.9

-3.0

-2.9

-2.9

-3.0

-2.9

-2.4

-3.0

-2.9

-2.4

0.0

0.0

-6.1

-8.4

-9.0

-6.1

-7.7

-8.4

-8.4

-8.4

-8.4

-7.7

-6.1

-9.0

-8.4

-6.1

0.0

0.0

-14.

-20.

-22.

-14.

-18.

-20.

-20.

-20.

-20.

-18.

-14.

-22.

-20.

-14.

0.0

k = 10

°

k = 3

optimal

policy

k =

64 of 99

Lecture 3: Planning by Dynamic Programming

How to Improve a Policy

Given a policy π

Evaluatethe policy π

vπ (s) = E [Rt+1 + γRt+2 + ...|St = s]

Improvethe policy by acting greedily with respect to vπ

πj = greedy(vπ )

In Small Gridworld improved policy was optimal, πj = π

In general, need more iterations of improvement / evaluation But this process ofpolicy iterationalways converges to π∗

65 of 99

Lecture 3: Planning by Dynamic Programming

Policy Iteration

Policy evaluation Estimate vπ

Iterative policy evaluation

Policy improvement Generate πj ≥ π

Greedy policy improvement

66 of 99

Lecture 3: Planning by Dynamic Programming

Policy Improvement

Consider a deterministic policy, a = π(s)

We can improve the policy by acting greedily

πj(s) = argmax qπ (s, a)

a∈A

This improves the value from any state s over one step,

qπ (s, πj(s)) = max qπ (s, a) qπ (s, π(s)) = vπ (s)

a∈A

It therefore improves the value function, vπj (s) vπ (s)

vπ (s) qπ (s, πj(s)) = Eπj [Rt+1 + γvπ (St+1) | St = s]

π

j

t+1 π t+1 t+1 t

E j R + γq (S , π (S )) | S = s

Σ

Σ

Eπj Rt+1 + γRt+2 + γ2qπ (St+2, πj(St+2)) | St = s

Σ

Eπj [Rt+1 + γRt+2 + ... | St = s] = vπj (s)

67 of 99

Lecture 3: Planning by Dynamic Programming

Policy Improvement (2)

If improvements stop,

qπ (s, πj(s)) = max qπ (s, a) = qπ (s, π(s)) = vπ (s)

a∈A

Then the Bellman optimality equation has been satisfied

vπ (s) = max qπ (s, a)

a∈A

Therefore vπ (s) = v(s) for all s ∈ S

so π is an optimal policy

68 of 99

Lecture 3: Planning by Dynamic Programming

Modified Policy Iteration

Does policy evaluation need to converge to vπ ?

Or should we introduce a stopping condition

e.g. s-convergence of value function

Or simply stop after k iterations of iterative policy evaluation?

For example, in the small gridworld k = 3 was sufficient to achieve optimal policy

Why not update policy every iteration? i.e. stop after k = 1

This is equivalent to value iteration (next section)

69 of 99

Lecture 3: Planning by Dynamic Programming

Generalised Policy Iteration

Policy evaluationEstimate vπ

Anypolicy evaluation algorithm

Policy improvementGenerate πj ≥ π

Anypolicy improvement algorithm

70 of 99

Lecture 3: Planning by Dynamic Programming

Principle of Optimality

Any optimal policy can be subdivided into two components: An optimal first action A

Followed by an optimal policy from successor state Sj

Theorem (Principle of Optimality)

A policy π(a|s) achieves the optimal value from state s, vπ (s) = v(s), if and only if

For any state sj reachable from s

π achieves the optimal value from state sj, vπ (sj) = v(sj)

71 of 99

Lecture 3: Planning by Dynamic Programming

Value Iteration

Problem: find optimal policy π

Solution: iterative application of Bellman optimality backup

v1 v2 → ... → v

Using synchronous backups At each iteration k + 1 For all states s ∈ S

Update vk+1(s) from vk (s')

Convergence to vwill be proven later

Unlike policy iteration, there is no explicit policy

Intermediate value functions may not correspond to any policy

72 of 99

Lecture 3: Planning by Dynamic Programming

Value Iteration (2)

vk+1(s) [ s

vk (s0) [ s0

k+1

a∈A

.

Σ

sj∈S

ss

j

k

a j

v (s) = max(R + γ P v (s ))

a∈A

a a

vk+1 = max R + γP vk

73 of 99

Lecture 3: Planning by Dynamic Programming

Asynchronous Dynamic Programming

DP methods described so far used synchronous backups

i.e. all states are backed up in parallel

Asynchronous DP backs up states individually, in any order For each selected state, apply the appropriate backup

Can significantly reduce computation

Guaranteed to converge if all states continue to be selected

74 of 99

Lecture 3: Planning by Dynamic Programming

Some Technical Questions

How do we know that value iteration converges to v? Or that iterative policy evaluation converges to vπ ?

And therefore that policy iteration converges to v? Is the solution unique?

How fast do these algorithms converge?

These questions are resolved by contraction mapping theorem

75 of 99

Lecture 4: Model-Free Prediction

Monte-Carlo Reinforcement Learning

MC methods learn directly from episodes of experience

MC is model-free: no knowledge of MDP transitions / rewards MC learns from complete episodes: no bootstrapping

MC uses the simplest possible idea: value = mean return Caveat: can only apply MC to episodic MDPs

All episodes must terminate

76 of 99

Lecture 4: Model-Free Prediction

Monte-Carlo Policy Evaluation

Goal: learn vπ from episodes of experience under policy π

S1, A1, R2, ..., Sk ∼ π

Recall that the return is the total discounted reward:

Gt = Rt+1 + γRt+2 + ... + γT1RT

Recall that the value function is the expected return:

vπ (s) = Eπ [Gt | St = s]

Monte-Carlo policy evaluation uses empirical mean return instead of expected return

77 of 99

Lecture 4: Model-Free Prediction

First-Visit Monte-Carlo Policy Evaluation

To evaluate state s

The first time-step t that state s is visited in an episode, Increment counter N(s) N(s) + 1

Increment total return S (s) S (s) + Gt

Value is estimated by mean return V (s) = S (s)/N(s) By law of large numbers, V (s) vπ (s) as N(s) → ∞

78 of 99

Lecture 4: Model-Free Prediction

Every-Visit Monte-Carlo Policy Evaluation

To evaluate state s

Every time-step t that state s is visited in an episode, Increment counter N(s) N(s) + 1

Increment total return S (s) S (s) + Gt

Value is estimated by mean return V (s) = S (s)/N(s) Again, V (s) vπ (s) as N(s) → ∞

79 of 99

Lecture 4: Model-Free Prediction

Incremental Mean

The mean µ1, µ2, ... of a sequence x1, x2, ... can be computed incrementally,

k

µ =

1

k

k

Σ

j=1

x

j

1

1

= (xk + (k 1)µk1)

k

= µk1 + k (xk − µk1)

80 of 99

Lecture 4: Model-Free Prediction

Incremental Monte-Carlo Updates

Update V (s) incrementally after episode S1, A1, R2, ..., ST

For each state St with return Gt

N(St ) N(St ) + 1

V (St ) V (St ) +

1

t

N(S )

(Gt V (St ))

In non-stationary problems, it can be useful to track a running mean, i.e. forget old episodes.

V (St ) V (St ) + α (Gt V (St ))

81 of 99

Lecture 4: Model-Free Prediction

Temporal-Difference Learning

TD methods learn directly from episodes of experience

TD is model-free: no knowledge of MDP transitions / rewards TD learns from incomplete episodes, by bootstrapping

TD updates a guess towards a guess

82 of 99

Lecture 4: Model-Free Prediction

MC and TD

Goal: learn vπ online from experience under policy π

Incremental every-visit Monte-Carlo

Update value V (St ) toward actual return Gt

V (St ) V (St ) + α (Gt V (St ))

Simplest temporal-difference learning algorithm: TD(0)

Update value V (St ) toward estimated return Rt+1 + γV (St+1)

V (St ) V (St ) + α (Rt+1 + γV (St+1) V (St ))

Rt+1 + γV (St+1) is called the TD target

δt = Rt+1 + γV (St+1) V (St ) is called the TD error

83 of 99

Lecture 4: Model-Free Prediction

Driving Home Example

State

Elapsed Time

Predicted

Predicted

leaving office

(minutes)

0

Time to Go

30

Total Time

30

reach car, raining

5

35

40

exit highway

20

15

35

behind truck

30

10

40

home street

40

3

43

arrive home

43

0

43

84 of 99

Lecture 4: Model-Free Prediction

Driving Home Example: MC vs. TD

Changes recommended by Monte Carlo methods (!=1)

Changes recommended by TD methods (!=1)

85 of 99

Lecture 4: Model-Free Prediction

Advantages and Disadvantages of MC vs. TD

TD can learn before knowing the final outcome

TD can learn online after every step

MC must wait until end of episode before return is known

TD can learn without the final outcome

TD can learn from incomplete sequences MC can only learn from complete sequences

TD works in continuing (non-terminating) environments MC only works for episodic (terminating) environments

86 of 99

Lecture 4: Model-Free Prediction

Bias/Variance Trade-Off

Return Gt = Rt+1 + γRt+2 + ... + γT1RT is unbiased

estimate of vπ (St )

True TD target Rt+1 + γvπ (St+1) is unbiased estimate of

vπ (St )

TD target Rt+1 + γV (St+1) is biased estimate of vπ (St ) TD target is much lower variance than the return:

Return depends on many random actions, transitions, rewards TD target depends on one random action, transition, reward

87 of 99

Lecture 4: Model-Free Prediction

Advantages and Disadvantages of MC vs. TD (2)

MC has high variance, zero bias

Good convergence properties

(even with function approximation) Not very sensitive to initial value Very simple to understand and use

TD has low variance, some bias Usually more efficient than MC TD(0) converges to vπ (s)

(but not always with function approximation) More sensitive to initial value

88 of 99

Lecture 4: Model-Free Prediction

Batch MC and TD

MC and TD converge: V (s) vπ (s) as experience → ∞

But what about batch solution for finite experience?

s1, a1, r 1, ..., s1

1 1 2 T1

.

.

sK , aK , r K , ..., sK

1 1 2 TK

e.g. Repeatedly sample episode k [1, K ] Apply MC or TD(0) to episode k

89 of 99

Lecture 4: Model-Free Prediction

AB Example

Two states A, B; no discounting; 8 episodes of experience

A, 0, B, 0

B, 1

B, 1

B, 1

B, 1

B, 1

B, 1

B, 0

What is V (A), V (B)?

90 of 99

Lecture 4: Model-Free Prediction

AB Example

Two states A, B; no discounting; 8 episodes of experience

A, 0, B, 0

B, 1

B, 1

B, 1

B, 1

B, 1

B, 1

B, 0

What is V (A), V (B)?

91 of 99

Lecture 4: Model-Free Prediction

Advantages and Disadvantages of MC vs. TD (3)

TD exploits Markov property

Usually more efficient in Markov environments

MC does not exploit Markov property

Usually more effective in non-Markov environments

92 of 99

Lecture 4: Model-Free Prediction

Monte-Carlo Backup

T

T

T

T

T

T

T

T

T

T

V (St ) V (St ) + α (Gt V (St ))

st

T

T

T

T

T

T

T

T

T

T

93 of 99

Lecture 4: Model-Free Prediction

Temporal-Difference Backup

T

T

T

T

T

T

T

T

T

T

st +1

rt +1

V (St ) V (St ) + α (Rt+1 + γV (St+1) V (St ))

st

T

T

T

T

T

T

T

T

T

T

94 of 99

Lecture 4: Model-Free Prediction

Dynamic Programming Backup

T

T

T

T

V (St ) Eπ [Rt+1 + γV (St+1)]

st

rt +1

st +1

T

T

T

T

T

T

T

T

T

95 of 99

Lecture 4: Model-Free Prediction

Bootstrapping and Sampling

Bootstrapping: update involves an estimate

MC does not bootstrap DP bootstraps

TD bootstraps

Sampling: update samples an expectation

MC samples

DP does not sample TD samples

96 of 99

Lecture 4: Model-Free Prediction

Unified View of Reinforcement Learning

97 of 99

Lecture 4: Model-Free Prediction

n-Step Prediction

Let TD target look n steps into the future

98 of 99

Lecture 4: Model-Free Prediction

n-Step Return

Consider the following n-step returns for n = 1, 2, ∞:

t

n = 1( TD) G (1) = Rt+1 + γV (St+1)

n = 2

t

G (2) = Rt+1 + γRt+2 + γ2V (St+2)

. .

.

t

.

()

n = (MC ) G = Rt+1 + γRt+2 + ... + γT1RT

Define the n-step return

t

G (n) = Rt+1 + γRt+2 + ... + γn1Rt+n + γnV (St+n)

n-step temporal-difference learning

.

(n)

t t t t

V (S ) V (S ) + α G V (S )

Σ

99 of 99

Lecture 4: Model-Free Prediction

Large Random Walk Example