Lecture 6��Dynamic Programming: Part 2- Policy Iter., Value Iter. & More on DP
1
Instructor: Ercan Atam
Institute for Data Science & Artificial Intelligence
Course: DSAI 542-Reinforcement Learning
2
List of contents for this lecture
3
Relevant readings for this lecture
4
Policy iteration (1)
4
4
4
4
Policy iteration combines the policy evaluation and policy improvement in an iterative sequence:
So, policy iteration is a sequence of monotonically improving policies and value functions.
5
Policy iteration (2)
5
5
5
5
6
Policy iteration example: forest tree MDP
6
6
6
6
(Source, derivative of W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)
7
Policy iteration example: forest tree MDP with “tree hater” initial policy (1)
7
7
7
7
(Source, derivative of W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)
8
Policy iteration example: forest tree MDP with “tree hater” initial policy (2)
8
8
8
8
9
Policy iteration example: forest tree MDP with “tree hater” initial policy (3)
9
9
9
9
10
Policy iteration example: forest tree MDP with “tree hater” initial policy (4)
10
10
10
10
11
Policy iteration example: forest tree MDP with “tree hater” initial policy (5)
11
11
11
11
12
Policy iteration example: forest tree MDP with “tree lover” initial policy (1)
12
12
12
12
(Source, derivative of W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)
13
Policy iteration example: forest tree MDP with “tree lover” initial policy (2)
13
13
13
13
14
Policy iteration example: forest tree MDP with “tree lover” initial policy (3)
14
14
14
14
15
Policy iteration example: forest tree MDP with “tree lover” initial policy (4)
15
15
15
15
16
Policy iteration example: forest tree MDP with “tree lover” initial policy (5)
16
16
16
16
17
Policy iteration algorithm
17
17
17
17
Algorithm: policy iteration
Note that each policy evaluation, itself an iterative computation,
is started with the state-value function of the previous policy.
This typically results in a great increase in the speed of convergence of policy evaluation (presumably because the value function changes little from one policy to the next).
18
Example: Jack’s car rental problem (1)
18
18
18
18
18
- first location: modelled as a poisson random variable with mean of 3 cars.
- second location: modelled as a poisson random variable with mean of 4 cars.
- first location: modelled as a poisson random variable with mean of 3 cars.
- second location: modelled as a poisson random variable with mean of 2 cars.
company. If he is out of cars at a location, then the business is lost.
$2 per car moved.
First location
Second location
19
Example: Jack’s car rental problem (2)
19
19
19
19
19
19
Fig.: Sequence of policies found by policy iteration including optimal state value after termination.
20
Value iteration (1)
20
20
20
20
20
20
20
21
Value iteration (2)
21
21
21
21
21
21
21
22
Value iteration relationship to Bellman optimality equation
22
22
22
22
22
22
22
Remember Bellman optimality equation for state value:
Remark: Note that value iteration is obtained simply by turning the Bellman optimality equations into an update rule.
And compare with value iteration:
23
Why is value iteration a “policy iteration with a truncated policy evaluation”?
23
23
23
23
23
23
23
Value iteration
meaning one policy evaluation
Policy iteration with a truncated policy evaluation
24
Pseudocode for value iteration
24
24
24
24
24
24
24
Note: compared to policy iteration, value iteration doesn’t require an initial policy but only a state-value guess.
25
Value iteration for forest tree MDP
25
25
25
25
25
25
25
(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)
Tab.: Value iteration for forest tree MDP
Note: The value iteration code (in Matlab) for this
example is given in “val_iter.m”
26
Summary of dynamic programming algorithms
26
26
26
26
26
26
26
26
26
26
Problem | Bellman equation | Algorithm |
Prediction | Bellman expectation equation | Iterative policy evaluation |
Control | Bellman expectation equation + Greedy policy improvement | Policy iteration |
Control | Bellman optimality equation | Value iteration |
Complexites:
27
Asynchronous dynamic programming (1)
27
27
27
27
27
27
27
27
27
27
27
28
Asynchronous dynamic programming (2)
28
28
28
28
28
28
28
28
28
28
28
states rather than the entire set of states at every policy evaluation iteration.
other states that are available.
of others are updated once.
29
Asynchronous DP: prioritized sweeping
29
29
29
29
29
29
29
29
29
29
(Source, derivative of W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)
Remember Bellman optimality equation for state value:
30
Asynchronous DP: real-time updates
30
30
30
30
30
30
30
30
30
30
(Source, derivative of W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)
Start states
Irrelevant states:
unreachable from any start state
under any optimal policy
Relevant states:
reachable from some start state
under some optimal policy
Fig.: Real-time DP updates focus on reachable states
31
Generalized policy iteration (GPI) (1)
31
31
31
31
31
31
31
31
31
31
(i) making the policy greedy with respect to the value function makes the value function incorrect
for the changed policy.
(ii) making the value function consistent with the policy cause that policy no longer to be greedy.
32
Generalized policy iteration (GPI) (2)
32
32
32
32
32
32
32
32
32
32
improvement (PI) step (i.e. PE, PI, PE, PI, PE, PI, PE, ...).
to the optimal policy.
the optimal policy.
33
Generalized policy iteration (GPI) (3)
33
33
33
33
33
33
33
33
33
33
33
to achieve it directly.
evaluation
improvement
Fig.: Interpretation of generalized policy iteration to switch back and forth between (arbitrary) evaluations and improvement steps.
policy evaluation line
policy improvement line
starting
34
Curse of dimensionality in dynamic programming (1)
34
34
34
34
34
34
34
34
34
34
34
(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)
35
Curse of dimensionality in dynamic programming (2)
35
35
35
35
35
35
35
35
35
35
35
(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)
36
Dynamic programming in practice
36
36
36
36
36
36
36
36
36
36
36
programming) for solving MDPs, DP methods are actually quite efficient.
better in general.
are often preferred.
References �(utilized for preparation of lecture notes or Matlab code)
37