2 of 99

Lecture 1: Introduction to Reinforcement Learning

About RL

Many Faces of Reinforcement Learning

Computer Science

Economics

Mathematics

Engineering

Neuroscience

Psychology

Machine Learning

Classical/Operant Conditioning

Optimal Control

Reward System

Operations Research

Bounded Rationality

Reinforcement Learning

3 of 99

Lecture 1: Introduction to Reinforcement Learning

About RL

Branches of Machine Learning

Reinforcement Learning

Supervised Learning

Unsupervised Learning

Machine Learning

4 of 99

Lecture 1: Introduction to Reinforcement Learning

About RL

Characteristics of Reinforcement Learning

What makes reinforcement learning different from other machine learning paradigms?

There is no supervisor, only a reward signal Feedback is delayed, not instantaneous

Time really matters (sequential)

Agent’s actions affect the subsequent data it receives

5 of 99

Lecture 1: Introduction to Reinforcement Learning

About RL

Examples of Reinforcement Learning

Fly stunt manoeuvres in a helicopter

Defeat the world champion at Backgammon Manage an investment portfolio

Control a power station

Make a humanoid robot walk

Play many different Atari games better than humans

6 of 99

Lecture 1: Introduction to Reinforcement Learning

The RL Problem Reward

Rewards

A reward R_tis a scalar feedback signal Indicates how well agent is doing at step t

The agent’s job is to maximise cumulative reward

Reinforcement learning is based on the reward hypothesis

Definition (Reward Hypothesis)

All goals can be described by the maximisation of expected cumulative reward

7 of 99

Lecture 1: Introduction to Reinforcement Learning

The RL Problem Reward

Examples of Rewards

Fly stunt manoeuvres in a helicopter

+ve reward for following desired trajectory

−ve reward for crashing

Defeat the world champion at Backgammon

+/−ve reward for winning/losing a game

Manage an investment portfolio

+ve reward for each $ in bank

Control a power station

+ve reward for producing power

−ve reward for exceeding safety thresholds

Make a humanoid robot walk

+ve reward for forward motion

−ve reward for falling over

Play many different Atari games better than humans

+/−ve reward for increasing/decreasing score

8 of 99

Lecture 1: Introduction to Reinforcement Learning

The RL Problem Reward

Sequential Decision Making

Goal: select actions to maximise total future reward

Actions may have long term consequences Reward may be delayed

It may be better to sacrifice immediate reward to gain more long-term reward

Examples:

A financial investment (may take months to mature)

Refuelling a helicopter (might prevent a crash in several hours) Blocking opponent moves (might help winning chances many moves from now)

9 of 99

Lecture 1: Introduction to Reinforcement Learning

The RL Problem Environments

Agent and Environment

observation

reward

action

A_t

R_t

O_t

10 of 99

Lecture 1: Introduction to Reinforcement Learning

The RL Problem Environments

Agent and Environment

observation

reward

action

A_t

R_t

O_t

At each step t the agent: Executes action A_tReceives observation O_tReceives scalar reward R_t

The environment:

Receives action A_t

Emits observation O_t₊₁

Emits scalar reward R_t₊₁

t increments at env. step

11 of 99

Lecture 1: Introduction to Reinforcement Learning

The RL Problem State

History and State

The history is the sequence of observations, actions, rewards

H_t= O₁, R₁, A₁, ..., A_t₋₁, O_t, R_t

i.e. all observable variables up to time t

i.e. the sensorimotor stream of a robot or embodied agent What happens next depends on the history:

The agent selects actions

The environment selects observations/rewards

State is the information used to determine what happens next Formally, state is a function of the history:

S_t= f (H_t)

12 of 99

Lecture 1: Introduction to Reinforcement Learning

The RL Problem State

Environment State

observation

reward

action

A_t

R_t

O_t

environment state S_t

The environment state S_tis the environment’s private representation

i.e. whatever data the environment uses to pick the next observation/reward

The environment state is not usually visible to the agent

Even if S^eis visible, it may

contain irrelevant

information

13 of 99

Lecture 1: Introduction to Reinforcement Learning

The RL Problem State

Agent State

observation

reward

action

A_t

R_t

O_t

agent state S^a

The agent state S^ais the

agent’s internal representation

i.e. whatever information the agent uses to pick the next action

i.e. it is the information used by reinforcement learning algorithms

It can be any function of history:

S^a= f (H_t)

14 of 99

Lecture 1: Introduction to Reinforcement Learning

The RL Problem State

Information State

An information state (a.k.a.Markov state) contains all useful information from the history.

Definition

A state S_tis Markov if and only if

P[S_t₊₁| S_t] = P[S_t₊₁| S₁, ..., S_t]

“The future is independent of the past given the present”

^H1:t ^→^St ^→^Ht+1:∞

Once the state is known, the history may be thrown away

i.e. The state is a sufficient statistic of the future

The environment state S^eis Markov

The history H_tis Markov

15 of 99

Lecture 1: Introduction to Reinforcement Learning

The RL Problem State

Fully Observable Environments

state

reward

action

A_t

R_t

S_t

Full observability: agentdirectly observes environment state

a e

O_t= S_t= S_t

Agent state = environment state = information state

Formally, this is a Markov decision process(MDP)

16 of 99

Lecture 1: Introduction to Reinforcement Learning

The RL Problem State

Partially Observable Environments

Partial observability: agent indirectly observes environment: A robot with camera vision isn’t told its absolute location

A trading agent only observes current prices

A poker playing agent only observes public cards

Now agent state ƒ= environment state

Formally this is a partially observable Markov decision process (POMDP)

Agent must construct its own state representation S^a, e.g.

Complete history: S^a= H_t

t t t

Beliefs of environment state: S^a= (P[S^e= s¹], ..., P[S^e= sⁿ])

t t−1

Recurrent neural network: S^a= σ(S^aW_s+ O_tW_o)

17 of 99

Lecture 1: Introduction to Reinforcement Learning

Inside An RL Agent

Major Components of an RL Agent

An RL agent may include one or more of these components:

Policy: agent’s behaviour function

Value function: how good is each state and/or action Model: agent’s representation of the environment

18 of 99

Lecture 1: Introduction to Reinforcement Learning

Inside An RL Agent

Policy

A policy is the agent’s behaviour

It is a map from state to action, e.g. Deterministic policy: a = π(s)

Stochastic policy: π(a|s) = P[A_t= a|S_t= s]

19 of 99

Lecture 1: Introduction to Reinforcement Learning

Inside An RL Agent

Value Function

Value function is a prediction of future reward Used to evaluate the goodness/badness of states And therefore to select between actions, e.g.

v_π(s) = E_π^ΣR_t₊₁+ γR_t₊₂+ γ²R_t₊₃+ ... | S_t= s^Σ

20 of 99

Lecture 1: Introduction to Reinforcement Learning

Inside An RL Agent

Model

Amodelpredicts what the environment will do next

P predicts the next state

R predicts the next (immediate) reward, e.g.

P = P[S

t+1

a j

t t

= s | S = s, A = a]

t+1 t

R = E [R | S = s, A = a]

21 of 99

Lecture 1: Introduction to Reinforcement Learning

Inside An RL Agent

Categorizing RL agents (1)

Value Based

No Policy (Implicit) Value Function

Policy Based

Policy

No Value Function

Actor Critic

Policy

Value Function

22 of 99

Lecture 1: Introduction to Reinforcement Learning

Inside An RL Agent

Categorizing RL agents (2)

Model Free

Policy and/or Value Function No Model

Model Based

Policy and/or Value Function Model

23 of 99

Lecture 1: Introduction to Reinforcement Learning

Inside An RL Agent

RL Agent Taxonomy

Model

Value Function

Policy

Actor Critic

Value-Based

Policy-Based

Model-Free

Model-Based

24 of 99

Lecture 1: Introduction to Reinforcement Learning

Problems within RL

Learning and Planning

Two fundamental problems in sequential decision making Reinforcement Learning:

The environment is initially unknown

The agent interacts with the environment The agent improves its policy

Planning:

A model of the environment is known

The agent performs computations with its model (without any external interaction)

The agent improves its policy

a.k.a. deliberation, reasoning, introspection, pondering, thought, search

25 of 99

Lecture 1: Introduction to Reinforcement Learning

Problems within RL

Exploration and Exploitation (1)

Reinforcement learning is like trial-and-error learning The agent should discover a good policy

From its experiences of the environment Without losing too much reward along the way

26 of 99

Lecture 1: Introduction to Reinforcement Learning

Problems within RL

Exploration and Exploitation (2)

Exploration finds more information about the environment Exploitation exploits known information to maximise reward It is usually important to explore as well as exploit

27 of 99

Lecture 1: Introduction to Reinforcement Learning

Problems within RL

Examples

Restaurant Selection

ExploitationGo to your favourite restaurant ExplorationTry a new restaurant

Online Banner Advertisements ExploitationShow the most successful advert ExplorationShow a different advert

Oil Drilling

ExploitationDrill at the best known location ExplorationDrill at a new location

Game Playing

ExploitationPlay the move you believe is best ExplorationPlay an experimental move

28 of 99

Lecture 1: Introduction to Reinforcement Learning

Problems within RL

Prediction and Control

Prediction: evaluate the future

Given a policy

Control: optimise the future

Find the best policy

29 of 99

Lecture 2: Markov Decision Processes

Markov Processes Introduction

Introduction to MDPs

Markov decision processes formally describe an environment for reinforcement learning

Where the environment is fully observable

i.e. The current state completely characterises the process Almost all RL problems can be formalised as MDPs, e.g.

Optimal control primarily deals with continuous MDPs Partially observable problems can be converted into MDPs Bandits are MDPs with one state

30 of 99

Lecture 2: Markov Decision Processes

Markov Processes Markov Property

Markov Property

“The future is independent of the past given the present”

Definition

A state S_tis Markov if and only if

P [S_t₊₁| S_t] = P [S_t₊₁| S₁, ..., S_t]

The state captures all relevant information from the history Once the state is known, the history may be thrown away

i.e. The state is a sufficient statistic of the future

31 of 99

Lecture 2: Markov Decision Processes

Markov Processes Markov Property

State Transition Matrix

For a Markov state s and successor state s^j, the state transition probability is defined by

P_ss^j= P ^ΣS_t₊₁= s^j| S_t= s^Σ

State transition matrix P defines transition probabilities from all states s to all successor states s^j,

P = from







P . . . P

^Pn1

. . .

^Pnn







where each row of the matrix sums to 1.

32 of 99

Lecture 2: Markov Decision Processes

Markov Processes Markov Chains

Markov Process

A Markov process is a memoryless random process, i.e. a sequence of random states S₁, S₂, ... with the Markov property.

Definition

A Markov Process (or Markov Chain) is a tuple (S, P) S is a (finite) set of states

P is a state transition probability matrix,

P_ss^j= P [S_t₊₁= s^j| S_t= s]

33 of 99

Lecture 2: Markov Decision Processes

Markov Reward Processes MRP

Markov Reward Process

A Markov reward process is a Markov chain with values. Definition

A Markov Reward Process is a tuple (S, P, R,γ) S is a finite set of states

P is a state transition probability matrix,

P_ss^j= P [S_t₊₁= s^j| S_t= s]

R is a reward function, R_s= E [R_t₊₁| S_t= s]

γ is a discount factor, γ ∈ [0, 1]

34 of 99

Lecture 2: Markov Decision Processes

Markov Reward Processes Return

Return

Definition

The return G_tis the total discounted reward from time-step t.

t+2

∞

k=0

G_t= R_t₊₁+ γR + ... = γ R

t+k+1

The discount γ ∈ [0, 1] is the present value of future rewards The value of receiving reward R after k + 1 time-steps is γ^kR. This values immediate reward above delayed reward.

γ close to 0 leads to ”myopic” evaluation

γ close to 1 leads to ”far-sighted” evaluation

35 of 99

Lecture 2: Markov Decision Processes

Markov Reward Processes Return

Why discount?

Most Markov reward and decision processes are discounted. Why?

Mathematically convenient to discount rewards Avoids infinite returns in cyclic Markov processes

Uncertainty about the future may not be fully represented

If the reward is financial, immediate rewards may earn more interest than delayed rewards

Animal/human behaviour shows preference for immediate reward

It is sometimes possible to use undiscounted Markov reward processes (i.e. γ = 1), e.g. if all sequences terminate.

36 of 99

Lecture 2: Markov Decision Processes

Markov Reward Processes Value Function

Value Function

The value function v (s) gives the long-term value of state s

Definition

The state value function v (s) of an MRP is the expected return starting from state s

v (s) = E [G_t| S_t= s]

37 of 99

Lecture 2: Markov Decision Processes

Markov Reward Processes Bellman Equation

Bellman Equation for MRPs

The value function can be decomposed into two parts: immediate reward R_t₊₁

discounted value of successor state γv (S_t₊₁)

v (s) = E [G_t| S_t= s]

= E R + γR

t+1 t+2 t+3 t

+ γ R + ... | S = s

Σ Σ

= E [R_t₊₁+ γ (R_t₊₂+ γR_t₊₃+ ...) | S_t= s]

= E [R_t₊₁+ γG_t₊₁| S_t= s]

= E [R_t₊₁+ γv (S_t₊₁) | S_t= s]

38 of 99

Lecture 2: Markov Decision Processes

Markov Reward Processes Bellman Equation

Bellman Equation for MRPs (2)

v (s) = E [R_t₊₁+ γv (S_t₊₁) | S_t= s]

v(s) s

v(s⁰)

_s0

s ss

s^j∈S

v (s) = R + γ P ^jv (s )

39 of 99

Lecture 2: Markov Decision Processes

Markov Reward Processes Bellman Equation

Solving the Bellman Equation

The Bellman equation is a linear equation It can be solved directly:

v = R + γPv

(I − γP) v = R

v = (I − γP)⁻¹R

Computational complexity is O(n³) for n states Direct solution only possible for small MRPs

There are many iterative methods for large MRPs, e.g.

Dynamic programming Monte-Carlo evaluation Temporal-Difference learning

40 of 99

Lecture 2: Markov Decision Processes

Markov Decision Processes MDP

Markov Decision Process

A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov.

Definition

A Markov Decision Process is a tuple (S, A, P, R,γ) S is a finite set of states

A is a finite set of actions

P is a state transition probability matrix,

a j

t+1 t t

P = P [S = s | S = s, A = a]

R is a reward function, R = E [R

t+1 t

| S = s, A = a]

γ is a discount factor γ ∈ [0, 1].

41 of 99

Lecture 2: Markov Decision Processes

Markov Decision Processes Policies

Policies (1)

Definition

A policy π is a distribution over actions given states,

π(a|s) = P [A_t= a | S_t= s]

A policy fully defines the behaviour of an agent

MDP policies depend on the current state (not the history)

i.e. Policies are stationary (time-independent),

A_t∼ π(·|S_t), ∀ t > 0

42 of 99

Lecture 2: Markov Decision Processes

Markov Decision Processes Value Functions

Value Function

Definition

The state-value function v_π(s) of an MDP is the expected return starting from state s, and then following policy π

v_π(s) = E_π[G_t| S_t= s]

Definition

The action-value function q_π(s, a) is the expected return

starting from state s, taking action a, and then following policy π

q_π(s, a) = E_π[G_t| S_t= s, A_t= a]

43 of 99

Lecture 2: Markov Decision Processes

Markov Decision Processes Bellman Expectation Equation

Bellman Expectation Equation

The state-value function can again be decomposed into immediate reward plus discounted value of successor state,

v_π(s) = E_π[R_t₊₁+ γv_π(S_t₊₁) | S_t= s] The action-value function can similarly be decomposed,

q_π(s, a) = E_π[R_t₊₁+ γq_π(S_t₊₁, A_t₊₁) | S_t= s, A_t= a]

44 of 99

Lecture 2: Markov Decision Processes

Markov Decision Processes Bellman Expectation Equation

Bellman Expectation Equation for V ^π

v_⇡(s) s

q_⇡(s, a) a

v (s) = π(a|s)q

a∈A

(s, a)

45 of 99

Lecture 2: Markov Decision Processes

Markov Decision Processes Bellman Expectation Equation

Bellman Expectation Equation for Q^π

v_⇡(s⁰)

_s0

q_⇡(s, a) s, a

s^j∈S

a j

q (s, a) = R + γ P v (s )

46 of 99

Lecture 2: Markov Decision Processes

Markov Decision Processes Bellman Expectation Equation

Bellman Expectation Equation for v_π(2)

v_⇡(s⁰)

_s0

v_⇡(s) s

a∈A s^j∈S

a j

v (s) = π(a|s) R + γ P v (s )

47 of 99

Lecture 2: Markov Decision Processes

Markov Decision Processes Bellman Expectation Equation

Bellman Expectation Equation for q_π(2)

q_⇡(s, a) s, a

_s0

q_⇡(s⁰, a⁰)

_a0

q (s, a) = R + γ P

s^j∈S

a ss

a^j∈A

j j

π(a |s )q

j j

(s , a )

48 of 99

Lecture 2: Markov Decision Processes

Markov Decision Processes Bellman Expectation Equation

Bellman Expectation Equation (Matrix Form)

The Bellman expectation equation can be expressed concisely using the induced MRP,

v_π= R^π+ γP^πv_π

with direct solution

v_π= (I − γP^π)⁻¹R^π

49 of 99

Lecture 2: Markov Decision Processes

Markov Decision Processes Optimal Value Functions

Optimal Value Function

Definition

The optimal state-value function v_∗(s) is the maximum value function over all policies

∗ π

v (s) = max v (s)

The optimal action-value function q_∗(s, a) is the maximum action-value function over all policies

∗ π

q (s, a) = max q (s, a)

The optimal value function specifies the best possible performance in the MDP.

An MDP is “solved” when we know the optimal value fn.

50 of 99

Lecture 2: Markov Decision Processes

Markov Decision Processes Optimal Value Functions

Optimal Policy

Define a partial ordering over policies

π ≥ πif v_π(s) ≥ v_π^j(s), ∀s

Theorem

For any Markov Decision Process

There exists an optimal policy π_∗that is better than or equal to all other policies, π_∗≥ π, ∀π

All optimal policies achieve the optimal value function, v_π_∗(s) = v_∗(s)

All optimal policies achieve the optimal action-value function, q_π_∗(s, a) = q_∗(s, a)

51 of 99

Lecture 2: Markov Decision Processes

Markov Decision Processes Optimal Value Functions

Finding an Optimal Policy

An optimal policy can be found by maximising over q_∗(s, a),

π_∗(a|s) =

a∈A

1 if a = argmax q_∗(s, a)

0 otherwise

There is always a deterministic optimal policy for any MDP If we know q_∗(s, a), we immediately have the optimal policy

52 of 99

Lecture 2: Markov Decision Processes

Markov Decision Processes Bellman Optimality Equation

Bellman Optimality Equation for v_∗

The optimal value functions are recursively related by the Bellman optimality equations:

v_⇤(s) s

q_⇤(s, a) a

v_∗(s) = max q_∗(s, a)

53 of 99

Lecture 2: Markov Decision Processes

Markov Decision Processes Bellman Optimality Equation

Bellman Optimality Equation for Q^∗

q_⇤(s, a) s, a

v_⇤(s⁰)

_s0

∗

s^j∈S

∗

a j

q (s, a) = R + γ P v (s )

54 of 99

Lecture 2: Markov Decision Processes

Markov Decision Processes Bellman Optimality Equation

Solving the Bellman Optimality Equation

Bellman Optimality Equation is non-linear No closed form solution (in general)

Many iterative solution methods

Value Iteration Policy Iteration Q-learning Sarsa

55 of 99

Lecture 3: Planning by Dynamic Programming

Introduction

What is Dynamic Programming?

Dynamicsequential or temporal component to the problem Programmingoptimising a “program”, i.e. a policy

c.f. linear programming

A method for solving complex problems By breaking them down into subproblems

Solve the subproblems

Combine solutions to subproblems

56 of 99

Lecture 3: Planning by Dynamic Programming

Introduction

Requirements for Dynamic Programming

Dynamic Programming is a very general solution method for problems which have two properties:

Optimal substructure

Principle of optimality applies

Optimal solution can be decomposed into subproblems

Overlapping subproblems

Subproblems recur many times Solutions can be cached and reused

Markov decision processes satisfy both properties Bellman equation gives recursive decomposition Value function stores and reuses solutions

57 of 99

Lecture 3: Planning by Dynamic Programming

Introduction

Planning by Dynamic Programming

Dynamic programming assumes full knowledge of the MDP It is used for planning in an MDP

For prediction:

Input: MDP (S, A, P, R, γ) and policy π

or: MRP (S, P^π, R^π, γ)

Output: value function v_π

Or for control:

Input: MDP (S, A, P, R, γ)

Output: optimal value function v_∗

and: optimal policy π_∗

58 of 99

Lecture 3: Planning by Dynamic Programming

Introduction

Other Applications of Dynamic Programming

Dynamic programming is used to solve many other problems, e.g.

Scheduling algorithms

String algorithms (e.g. sequence alignment) Graph algorithms (e.g. shortest path algorithms) Graphical models (e.g. Viterbi algorithm) Bioinformatics (e.g. lattice models)

59 of 99

Lecture 3: Planning by Dynamic Programming

Policy Evaluation

Iterative Policy Evaluation

Problem: evaluate a given policy π

Solution: iterative application of Bellman expectation backup

v₁→ v₂→ ... → v_π

Using synchronous backups, At each iteration k + 1 For all states s ∈ S

Update v_k₊₁(s) from v_k(s^j) where s^jis a successor state of s

We will discuss asynchronous backups later

Convergence to v_πwill be proven at the end of the lecture

60 of 99

Lecture 3: Planning by Dynamic Programming

Policy Evaluation

Iterative Policy Evaluation

Iterative Policy Evaluation (2)

v_k₊₁(s) [ s

v_k(s⁰) [ s⁰

k+1

a j

v (s) = π(a|s)(R + γ P v (s ))

a∈A s^j∈S

v^k⁺¹= R^π+ γP^πv^k

61 of 99

Lecture 3: Planning by Dynamic Programming

Policy Evaluation

Example: Small Gridworld

Evaluating a Random Policy in the Small Gridworld

Undiscounted episodic MDP (γ = 1) Nonterminal states 1, ..., 14

One terminal state (shown twice as shaded squares) Actions leading out of the grid leave state unchanged Reward is −1 until the terminal state is reached Agent follows uniform random policy

π(n|·) = π(e|·) = π(s|·) = π(w|·) = 0.25

62 of 99

Lecture 3: Planning by Dynamic Programming

Policy Evaluation

Example: Small Gridworld

Iterative Policy Evaluation in Small Gridworld

0.0	0.0	0.0	0.0
0.0	0.0	0.0	0.0
0.0	0.0	0.0	0.0
0.0	0.0	0.0	0.0

0.0	-1.0	-1.0	-1.0
-1.0	-1.0	-1.0	-1.0
-1.0	-1.0	-1.0	-1.0
-1.0	-1.0	-1.0	0.0

0.0	-1.7	-2.0	-2.0
-1.7	-2.0	-2.0	-2.0
-2.0	-2.0	-2.0	-1.7
-2.0	-2.0	-1.7	0.0

V_k

k = 0

k = 1

k = 2

random

policy

v_kfor the Random Policy

Greedy Policy

^w.r.t.v_k

63 of 99

Lecture 3: Planning by Dynamic Programming

Policy Evaluation

Example: Small Gridworld

Iterative Policy Evaluation in Small Gridworld (2)

0.0	-2.4	-2.9	-3.0
-2.4	-2.9	-3.0	-2.9
-2.9	-3.0	-2.9	-2.4
-3.0	-2.9	-2.4	0.0

0.0	-6.1	-8.4	-9.0
-6.1	-7.7	-8.4	-8.4
-8.4	-8.4	-7.7	-6.1
-9.0	-8.4	-6.1	0.0

0.0	-14.	-20.	-22.
-14.	-18.	-20.	-20.
-20.	-20.	-18.	-14.
-22.	-20.	-14.	0.0

k = 10

k = 3

optimal

policy

^k⁼∞

64 of 99

Lecture 3: Planning by Dynamic Programming

Policy Iteration

How to Improve a Policy

Given a policy π

Evaluatethe policy π

v_π(s) = E [R_t₊₁+ γR_t₊₂+ ...|S_t= s]

Improvethe policy by acting greedily with respect to v_π

π^j= greedy(v_π)

In Small Gridworld improved policy was optimal, π^j= π^∗

In general, need more iterations of improvement / evaluation But this process ofpolicy iterationalways converges to π∗

65 of 99

Lecture 3: Planning by Dynamic Programming

Policy Iteration

Policy evaluation Estimate v_π

Iterative policy evaluation

Policy improvement Generate π^j≥ π

Greedy policy improvement

66 of 99

Lecture 3: Planning by Dynamic Programming

Policy Iteration

Policy Improvement

Consider a deterministic policy, a = π(s)

We can improve the policy by acting greedily

π^j(s) = argmax q_π(s, a)

a∈A

This improves the value from any state s over one step,

q_π(s, π^j(s)) = max q_π(s, a) ≥ q_π(s, π(s)) = v_π(s)

a∈A

It therefore improves the value function, v_π^j(s) ≥ v_π(s)

v_π(s) ≤ q_π(s, π^j(s)) = E_π^j[R_t₊₁+ γv_π(S_t₊₁) | S_t= s]

t+1 π t+1 t+1 t

≤ E ^jR + γq (S , π (S )) | S = s

≤ E_π^jR_t₊₁+ γR_t₊₂+ γ²q_π(S_t₊₂, π^j(S_t₊₂)) | S_t= s

≤ E_π^j[R_t₊₁+ γR_t₊₂+ ... | S_t= s] = v_π^j(s)

67 of 99

Lecture 3: Planning by Dynamic Programming

Policy Iteration

Policy Improvement

Policy Improvement (2)

If improvements stop,

q_π(s, π^j(s)) = max q_π(s, a) = q_π(s, π(s)) = v_π(s)

a∈A

Then the Bellman optimality equation has been satisfied

v_π(s) = max q_π(s, a)

a∈A

Therefore v_π(s) = v_∗(s) for all s ∈ S

so π is an optimal policy

68 of 99

Lecture 3: Planning by Dynamic Programming

Policy Iteration

Extensions to Policy Iteration

Modified Policy Iteration

Does policy evaluation need to converge to v_π?

Or should we introduce a stopping condition

e.g. s-convergence of value function

Or simply stop after k iterations of iterative policy evaluation?

For example, in the small gridworld k = 3 was sufficient to achieve optimal policy

Why not update policy every iteration? i.e. stop after k = 1

This is equivalent to value iteration (next section)

69 of 99

Lecture 3: Planning by Dynamic Programming

Policy Iteration

Extensions to Policy Iteration

Generalised Policy Iteration

Policy evaluationEstimate v_π

Anypolicy evaluation algorithm

Policy improvementGenerate π^j≥ π

Anypolicy improvement algorithm

70 of 99

Lecture 3: Planning by Dynamic Programming

Value Iteration

Value Iteration in MDPs

Principle of Optimality

Any optimal policy can be subdivided into two components: An optimal first action A_∗

Followed by an optimal policy from successor state S^j

Theorem (Principle of Optimality)

A policy π(a|s) achieves the optimal value from state s, v_π(s) = v_∗(s), if and only if

For any state s^jreachable from s

π achieves the optimal value from state s^j, v_π(s^j) = v_∗(s^j)

71 of 99

Lecture 3: Planning by Dynamic Programming

Value Iteration

Value Iteration in MDPs

Value Iteration

Problem: find optimal policy π

Solution: iterative application of Bellman optimality backup

v₁→ v₂→ ... → v_∗

Using synchronous backups At each iteration k + 1 For all states s ∈ S

Update v_k₊₁(s) from v_k(s^')

Convergence to v_∗will be proven later

Unlike policy iteration, there is no explicit policy

Intermediate value functions may not correspond to any policy

72 of 99

Lecture 3: Planning by Dynamic Programming

Value Iteration

Value Iteration in MDPs

Value Iteration (2)

v_k₊₁(s) [ s

v_k(s⁰) [ s⁰

k+1

a∈A

s^j∈S

a j

v (s) = max(R + γ P v (s ))

a∈A

a a

v_k₊₁= max R + γP v_k

73 of 99

Lecture 3: Planning by Dynamic Programming

Extensions to Dynamic Programming Asynchronous Dynamic Programming

Asynchronous Dynamic Programming

DP methods described so far used synchronous backups

i.e. all states are backed up in parallel

Asynchronous DP backs up states individually, in any order For each selected state, apply the appropriate backup

Can significantly reduce computation

Guaranteed to converge if all states continue to be selected

74 of 99

Lecture 3: Planning by Dynamic Programming

Contraction Mapping

Some Technical Questions

How do we know that value iteration converges to v_∗? Or that iterative policy evaluation converges to v_π?

And therefore that policy iteration converges to v_∗? Is the solution unique?

How fast do these algorithms converge?

These questions are resolved by contraction mapping theorem

75 of 99

Lecture 4: Model-Free Prediction

Monte-Carlo Learning

Monte-Carlo Reinforcement Learning

MC methods learn directly from episodes of experience

MC is model-free: no knowledge of MDP transitions / rewards MC learns from complete episodes: no bootstrapping

MC uses the simplest possible idea: value = mean return Caveat: can only apply MC to episodic MDPs

All episodes must terminate

76 of 99

Lecture 4: Model-Free Prediction

Monte-Carlo Learning

Monte-Carlo Policy Evaluation

Goal: learn v_πfrom episodes of experience under policy π

S₁, A₁, R₂, ..., S_k∼ π

Recall that the return is the total discounted reward:

G_t= R_t₊₁+ γR_t₊₂+ ... + γ^T⁻¹R_T

Recall that the value function is the expected return:

v_π(s) = E_π[G_t| S_t= s]

Monte-Carlo policy evaluation uses empirical mean return instead of expected return

77 of 99

Lecture 4: Model-Free Prediction

Monte-Carlo Learning

First-Visit Monte-Carlo Policy Evaluation

To evaluate state s

The first time-step t that state s is visited in an episode, Increment counter N(s) ← N(s) + 1

Increment total return S (s) ← S (s) + G_t

Value is estimated by mean return V (s) = S (s)/N(s) By law of large numbers, V (s) → v_π(s) as N(s) → ∞

78 of 99

Lecture 4: Model-Free Prediction

Monte-Carlo Learning

Every-Visit Monte-Carlo Policy Evaluation

To evaluate state s

Every time-step t that state s is visited in an episode, Increment counter N(s) ← N(s) + 1

Increment total return S (s) ← S (s) + G_t

Value is estimated by mean return V (s) = S (s)/N(s) Again, V (s) → v_π(s) as N(s) → ∞

79 of 99

Lecture 4: Model-Free Prediction

Monte-Carlo Learning Incremental Monte-Carlo

Incremental Mean

The mean µ₁, µ₂, ... of a sequence x₁, x₂, ... can be computed incrementally,

µ =

j=1



= (x_k+ (k − 1)µ_k₋₁)

= µ_k₋₁+ _k(x_k− µ_k₋₁)

80 of 99

Lecture 4: Model-Free Prediction

Monte-Carlo Learning Incremental Monte-Carlo

Incremental Monte-Carlo Updates

Update V (s) incrementally after episode S₁, A₁, R₂, ..., S_T

For each state S_twith return G_t

N(S_t) ← N(S_t) + 1

V (S_t) ← V (S_t) +

N(S )

(G_t− V (S_t))

In non-stationary problems, it can be useful to track a running mean, i.e. forget old episodes.

V (S_t) ← V (S_t) + α (G_t− V (S_t))

81 of 99

Lecture 4: Model-Free Prediction

Temporal-Difference Learning

TD methods learn directly from episodes of experience

TD is model-free: no knowledge of MDP transitions / rewards TD learns from incomplete episodes, by bootstrapping

TD updates a guess towards a guess

82 of 99

Lecture 4: Model-Free Prediction

Temporal-Difference Learning

MC and TD

Goal: learn v_πonline from experience under policy π

Incremental every-visit Monte-Carlo

Update value V (S_t) toward actual return G_t

V (S_t) ← V (S_t) + α (G_t− V (S_t))

Simplest temporal-difference learning algorithm: TD(0)

Update value V (S_t) toward estimated return R_t₊₁+ γV (S_t₊₁)

V (S_t) ← V (S_t) + α (R_t₊₁+ γV (S_t₊₁) − V (S_t))

R_t₊₁+ γV (S_t₊₁) is called the TD target

δ_t= R_t₊₁+ γV (S_t₊₁) − V (S_t) is called the TD error

83 of 99

Lecture 4: Model-Free Prediction

Temporal-Difference Learning Driving Home Example

Driving Home Example

State	Elapsed Time	Predicted	Predicted
leaving office	(minutes) 0	Time to Go 30	Total Time 30
reach car, raining	5	35	40
exit highway	20	15	35
behind truck	30	10	40
home street	40	3	43
arrive home	43	0	43

84 of 99

Lecture 4: Model-Free Prediction

Temporal-Difference Learning Driving Home Example

Driving Home Example: MC vs. TD

Changes recommended by Monte Carlo methods (!=1)

Changes recommended by TD methods (!=1)

85 of 99

Lecture 4: Model-Free Prediction

Temporal-Difference Learning Driving Home Example

Advantages and Disadvantages of MC vs. TD

TD can learn before knowing the final outcome

TD can learn online after every step

MC must wait until end of episode before return is known

TD can learn without the final outcome

TD can learn from incomplete sequences MC can only learn from complete sequences

TD works in continuing (non-terminating) environments MC only works for episodic (terminating) environments

86 of 99

Lecture 4: Model-Free Prediction

Temporal-Difference Learning Driving Home Example

Bias/Variance Trade-Off

Return G_t= R_t₊₁+ γR_t₊₂+ ... + γ^T⁻¹R_Tis unbiased

estimate of v_π(S_t)

True TD target R_t₊₁+ γv_π(S_t₊₁) is unbiased estimate of

v_π(S_t)

TD target R_t₊₁+ γV (S_t₊₁) is biased estimate of v_π(S_t) TD target is much lower variance than the return:

Return depends on many random actions, transitions, rewards TD target depends on one random action, transition, reward

87 of 99

Lecture 4: Model-Free Prediction

Temporal-Difference Learning Driving Home Example

Advantages and Disadvantages of MC vs. TD (2)

MC has high variance, zero bias

Good convergence properties

(even with function approximation) Not very sensitive to initial value Very simple to understand and use

TD has low variance, some bias Usually more efficient than MC TD(0) converges to v_π(s)

(but not always with function approximation) More sensitive to initial value

88 of 99

Lecture 4: Model-Free Prediction

Temporal-Difference Learning Batch MC and TD

Batch MC and TD

MC and TD converge: V (s) → v_π(s) as experience → ∞

But what about batch solution for finite experience?

s¹, a¹, r ¹, ..., s¹

1 1 2 T₁

s^K, a^K, r ^K, ..., s^K

1 1 2 T_K

e.g. Repeatedly sample episode k ∈ [1, K ] Apply MC or TD(0) to episode k

89 of 99

Lecture 4: Model-Free Prediction

Temporal-Difference Learning Batch MC and TD

AB Example

Two states A, B; no discounting; 8 episodes of experience

A, 0, B, 0

B, 1

B, 0

What is V (A), V (B)?

90 of 99

Lecture 4: Model-Free Prediction

Temporal-Difference Learning Batch MC and TD

AB Example

Two states A, B; no discounting; 8 episodes of experience

A, 0, B, 0

B, 1

B, 0

What is V (A), V (B)?

91 of 99

Lecture 4: Model-Free Prediction

Temporal-Difference Learning Batch MC and TD

Advantages and Disadvantages of MC vs. TD (3)

TD exploits Markov property

Usually more efficient in Markov environments

MC does not exploit Markov property

Usually more effective in non-Markov environments

92 of 99

Lecture 4: Model-Free Prediction

Temporal-Difference Learning Unified View

Monte-Carlo Backup

V (S_t) ← V (S_t) + α (G_t− V (S_t))

s_t

93 of 99

Lecture 4: Model-Free Prediction

Temporal-Difference Learning Unified View

Temporal-Difference Backup

^st +1

^rt +1

V (S_t) ← V (S_t) + α (R_t₊₁+ γV (S_t₊₁) − V (S_t))

s_t

94 of 99

Lecture 4: Model-Free Prediction

Temporal-Difference Learning Unified View

Dynamic Programming Backup

V (S_t) ← E_π[R_t₊₁+ γV (S_t₊₁)]

s_t

^rt +1

^st +1

95 of 99

Lecture 4: Model-Free Prediction

Temporal-Difference Learning Unified View

Bootstrapping and Sampling

Bootstrapping: update involves an estimate

MC does not bootstrap DP bootstraps

TD bootstraps

Sampling: update samples an expectation

MC samples

DP does not sample TD samples

96 of 99

Lecture 4: Model-Free Prediction

Temporal-Difference Learning Unified View

Unified View of Reinforcement Learning

97 of 99

Lecture 4: Model-Free Prediction

TD(λ )

n -Step TD

n-Step Prediction

Let TD target look n steps into the future

98 of 99

Lecture 4: Model-Free Prediction

TD(λ )

n -Step TD

n-Step Return

Consider the following n-step returns for n = 1, 2, ∞:

n = 1( TD) G ⁽¹⁾= R_t₊₁+ γV (S_t₊₁)

n = 2

G ⁽²⁾= R_t₊₁+ γR_t₊₂+ γ²V (S_t₊₂)

. .

(∞)

n = ∞ (MC ) G = R_t₊₁+ γR_t₊₂+ ... + γ^T⁻¹R_T

Define the n-step return

G ⁽ⁿ⁾= R_t₊₁+ γR_t₊₂+ ... + γⁿ⁻¹R_t₊_n+ γⁿV (S_t₊_n)

n-step temporal-difference learning

(n)

^t tt ^t

V (S ) ← V (S ) + α G − V (S )