Lecture 3��Markov Decision Processes: Part 1-MDP Components & Bellman Equations
1
Instructor: Ercan Atam
Institute for Data Science & Artificial Intelligence
Course: DSAI 542-Reinforcement Learning
2
List of contents for this lecture
3
Relevant readings for this lecture
4
observation
action
We will develop a mathematical formulation of the agent-environment interaction through a modelling
framework known as Markov Decision Process (MDP).
Introduction
5
Recap: episodic versus continuous tasks
of the other.
6
Markov processes
Markov reward processes
Markov decision processes
(MPs)
(MRPs)
(MDPs)
Simple
Complex
Three processes: from simple to complex (1)
7
Markov processes
Markov reward processes
Markov decision processes
(MPs)
(MRPs)
(MDPs)
Simple
Complex
All three of the above systems share a very important property known as “Markov” property.
Definition
(i.e., the current state has sufficient statistics of history)
Three processes: from simple to complex (2)
8
A remark on scalar and vectorial representations in finite MDPs
(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)
9
Definition
Markov Processes(MPs)
10
Definition (C-K equation for finite MPs)
Chapman-Kolmogorov equation for MPs (1)
Example:
11
Chapman-Kolmogorov equation for MPs (2)
12
Example of a Markov Chain (1)
(Source, derivative of W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)
13
Example of a Markov Chain (2)
Possible samples for the given Markov chain example starting from s = 1 (small tree):
(Source, derivative of W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)
14
Example of a Markov Chain (3)
(Source, derivative of W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)
15
Definition
Markov Reward Processes (MRPs) (1)
16
. . .
Markov Reward Processes (MRPs) (2)
17
Example of a MRP
(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)
Fig: Forest MRP
18
Definition
Recap: Return
19
Observe that:
Expressing return in a recursive way
20
Example 1: return for an MRP
0.1
0.5
0.2
0.7
0.9
0.1
0.1
0.7
0.4
0.2
0.1
1
Start
1
2
3
4
Exit
5
R=-10
R=-5
R=-5
R=-1
R=0
21
Example 2: calculating return recursively for an MRP
22
A couple of reasons and/or motivations for using a discount rate can be listed as follows.
23
Definition
Value function for MRPs
Fig: Isolines indicate state-value of different golf ball locations (source, Textbook)
24
State-value samples of Forest MRP
(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)
25
Lemma: law of iterated expectations
First, we need a mathematical fact to be used.
Bellman expectation equation for state-value function of MRPs (1)
26
Bellman expectation equation for state-value function of MRPs
Bellman expectation equation for state-value function of MRPs (2)
immediate reward
discounted value of successor state
27
Bellman equation for state-value function of MRPs
Bellman equation for state-value function of MRPs (1)
28
We can express the Bellman equation for state value function concisely using matrices:
Bellman equation for state-value function of MRPs (2)
29
Solving the MRP Bellman equation for state value function
30
Example: state values for Forest MRP
Fig: Forest Markov reward process including state values.
(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)
31
Markov Decision Processes(MDPs): diagram
32
Note:
Markov Decision Processes (MDPs)
Definition (MDP=MRP extended with actions)
33
Generic backup diagram for MDPs
34
a. reward =-1 for every time step, until car reaches
the top
b. reward = 1 at the top, 0 otherwise (0 after reach)
Example 1: Mountain-Car (illustrating states, actions, rewards )
35
Example 2: Forest MDP (1)
(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)
36
Example 2: Forest MDP (2)
(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)
For the forest MDP example the state transition probability matrix and
reward function are given as:
the resulting different computational complexities).
37
Formulating problems as MDPs
38
Definition
We differentiate between two kinds of policies:
Remarks:
Policy (1)
39
Policy (2)
40
Definition (state-value function)
Definition (action-value function)
Value functions for MDPs
41
The state-value function can again be decomposed into immediate reward plus discounted value of successor state:
Bellman expectation equation for state-value function of MDPs
Bellman expectation equation for state-value function of MDPs
42
Bellman expectation equation for action-value function of MDPs
The action-value function can similarly be decomposed as:
Bellman expectation equation for action-value function of MDPs
43
44
45
46
47
48
49
50
51
52
Summary Bellman expectation equations and backup diagrams
53
Bellman expectation equation for state-value function in matrix form for MDPs (1)
54
Bellman expectation equation for state-value function in matrix form for MDPs (2)
55
State-value calculation for “Forest Tree” using Bellman expectation equation (1)
(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)
56
State-value calculation for “Forest Tree” using Bellman expectation equation (2)
Fig.: Forest MDP with fifty-fifty-policy including state values
(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)
57
Action-value calculation for “Forest Tree” using Bellman expectation equation
The action values can be directly calculated using the following Bellman expectation equation:
Fig. : Forest MDP with fifty-fifty-policy plus action values
(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)
Q: Is the “fifty-fifty-policy” maximizing the long-term reward? How can you evaluate your answer?
References �(utilized for preparation of lecture notes)
58