1 of 58

Lecture 3��Markov Decision Processes: Part 1-MDP Components & Bellman Equations

1

Instructor: Ercan Atam

Institute for Data Science & Artificial Intelligence

Course: DSAI 542-Reinforcement Learning

2 of 58

2

List of contents for this lecture

Markov processes

Markov reward processes

Markov decision processes

Bellman equations for value functions

3 of 58

3

Relevant readings for this lecture

Chapter 3 of “Reinforcement Learning: An Introduction”, Richard S. Sutton and Andrew G. Barto, Second Edition, MIT Press, Cambridge, MA, 2018

4 of 58

4

observation

action

We will develop a mathematical formulation of the agent-environment interaction through a modelling

framework known as Markov Decision Process (MDP).

Introduction

5 of 58

5

Recap: episodic versus continuous tasks

Episodic tasks are defined as tasks which have a beginning and terminal state (end). These tasks take finite time, and an episode contains all states that come in between an initial state and a terminal state.

For example, in a car racing video game, you start the game (initial state) and play the game until it is over (final state). This is called an episode.

Once the game is over, you start the next episode by restarting the game, and you will begin from the initial state irrespective of the position you were in the previous game. So, each episode is independent

of the other.

In a continuous task, there is no terminal state. Continuous tasks will never end. For example, a personal assistance robot does not have a terminal state.

6 of 58

6

Markov processes

Markov reward processes

Markov decision processes

(MPs)

(MRPs)

(MDPs)

Simple

Complex

Three processes: from simple to complex (1)

7 of 58

7

Markov processes

Markov reward processes

Markov decision processes

(MPs)

(MRPs)

(MDPs)

Simple

Complex

All three of the above systems share a very important property known as “Markov” property.

Definition

The current state captures all relevant information from history.
Once the current state is known, the history can be thrown away.

(i.e., the current state has sufficient statistics of history)

Three processes: from simple to complex (2)

8 of 58

8

A remark on scalar and vectorial representations in finite MDPs

(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)

9 of 58

9

Definition

Markov Processes(MPs)

10 of 58

10

Definition (C-K equation for finite MPs)

Chapman-Kolmogorov equation for MPs (1)

Example:

11 of 58

11

Chapman-Kolmogorov equation for MPs (2)

12 of 58

12

Example of a Markov Chain (1)

(Source, derivative of W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)

13 of 58

13

Example of a Markov Chain (2)

Possible samples for the given Markov chain example starting from s = 1 (small tree):

Small → gone

Small → medium → gone

Small → medium → large → gone

Small → medium → large → large → . . .

(Source, derivative of W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)

14 of 58

14

Example of a Markov Chain (3)

(Source, derivative of W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)

15 of 58

15

Definition

Markov Reward Processes (MRPs) (1)

16 of 58

16

. . .

Markov Reward Processes (MRPs) (2)

17 of 58

17

Example of a MRP

Growing larger trees is rewarded, since it will be:

appreciated by hikers and

useful for wood production

Loosing a tree due to a hazard is unrewarded.

(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)

Fig: Forest MRP

18 of 58

18

Definition

Recap: Return

19 of 58

19

Observe that:

Expressing return in a recursive way

20 of 58

20

Example 1: return for an MRP

0.1

0.5

0.2

0.7

0.9

0.1

0.7

0.4

0.2

0.1

1

Start

1

2

3

4

Exit

5

R=-10

R=-5

R=-1

R=0

21 of 58

21

Example 2: calculating return recursively for an MRP

22 of 58

22

A couple of reasons and/or motivations for using a discount rate can be listed as follows.

23 of 58

23

Definition

Value function for MRPs

Fig: Isolines indicate state-value of different golf ball locations (source, Textbook)

24 of 58

24

State-value samples of Forest MRP

(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)

25 of 58

25

Lemma: law of iterated expectations

First, we need a mathematical fact to be used.

Bellman expectation equation for state-value function of MRPs (1)

26 of 58

26

Bellman expectation equation for state-value function of MRPs

Bellman expectation equation for state-value function of MRPs (2)

immediate reward

discounted value of successor state

27 of 58

27

Bellman equation for state-value function of MRPs

Bellman equation for state-value function of MRPs (1)

28 of 58

28

We can express the Bellman equation for state value function concisely using matrices:

Bellman equation for state-value function of MRPs (2)

29 of 58

29

Solving the MRP Bellman equation for state value function

30 of 58

30

Example: state values for Forest MRP

Fig: Forest Markov reward process including state values.

(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)

31 of 58

31

Markov Decision Processes(MDPs): diagram

32 of 58

32

Note:

Markov Decision Processes (MDPs)

Definition (MDP=MRP extended with actions)

33 of 58

33

Generic backup diagram for MDPs

34 of 58

34

States: car position and car velocity

Actions: accelerate forward, accelerate backward, coast

Two reward formulations:

a. reward =-1 for every time step, until car reaches

the top

b. reward = 1 at the top, 0 otherwise (0 after reach)

Example 1: Mountain-Car (illustrating states, actions, rewards )

35 of 58

35

Example 2: Forest MDP (1)

Two actions possible in each state:

Wait a = w: let the tree grow.

Cut a = c: gather the wood.

With increasing tree size the wood reward increases as well.

(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)

36 of 58

36

Example 2: Forest MDP (2)

(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)

For the forest MDP example the state transition probability matrix and

reward function are given as:

37 of 58

It is important that the states are defined in such a way that the Markov property holds.

The rewards are quite “objective", they are intended to capture the goal for the problem.

Often there are several ways to formulate a sequential decision problem as an MDP (with

the resulting different computational complexities).

37

Formulating problems as MDPs

38 of 58

38

The goal of the agent is to find a way of controlling the system (finding a strategy or plan), which is called a policy, in order to maximize the expected value of the return at every time step.

Definition

We differentiate between two kinds of policies:

Remarks:

A policy fully defines the behavior of an agent.
MDP policies depend on the current state (not the history). I.e., policies are stationary (time-independent):

Policy (1)

39 of 58

39

Policy (2)

40 of 58

40

Definition (state-value function)

Definition (action-value function)

Value functions for MDPs

41 of 58

41

The state-value function can again be decomposed into immediate reward plus discounted value of successor state:

Bellman expectation equation for state-value function of MDPs

42 of 58

42

Bellman expectation equation for action-value function of MDPs

The action-value function can similarly be decomposed as:

Bellman expectation equation for action-value function of MDPs

43 of 58

43

44 of 58

44

45 of 58

45

46 of 58

46

47 of 58

47

48 of 58

48

49 of 58

49

50 of 58

50

51 of 58

51

52 of 58

52

Summary Bellman expectation equations and backup diagrams

53 of 58

53

Bellman expectation equation for state-value function in matrix form for MDPs (1)

54 of 58

54

Bellman expectation equation for state-value function in matrix form for MDPs (2)

55 of 58

55

State-value calculation for “Forest Tree” using Bellman expectation equation (1)

(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)

56 of 58

56

State-value calculation for “Forest Tree” using Bellman expectation equation (2)

Fig.: Forest MDP with fifty-fifty-policy including state values

(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)

57 of 58

57

Action-value calculation for “Forest Tree” using Bellman expectation equation

The action values can be directly calculated using the following Bellman expectation equation:

Fig. : Forest MDP with fifty-fifty-policy plus action values

(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)

Q: Is the “fifty-fifty-policy” maximizing the long-term reward? How can you evaluate your answer?

58 of 58

References �(utilized for preparation of lecture notes)

58