1 of 37

Lecture 6��Dynamic Programming: Part 2- Policy Iter., Value Iter. & More on DP

1

Instructor: Ercan Atam

Institute for Data Science & Artificial Intelligence

Course: DSAI 542-Reinforcement Learning

2 of 37

2

List of contents for this lecture

  • Policy iteration

  • Value iteration

  • Asynchronous dynamic programming

  • Generalized policy iteration

  • Efficiency of dynamic programming

3 of 37

3

Relevant readings for this lecture

  • Chapter 4 of “Reinforcement Learning: An Introduction”, Richard S. Sutton and Andrew G. Barto, Second Edition, MIT Press, Cambridge, MA, 2018

4 of 37

4

Policy iteration (1)

4

4

4

4

Policy iteration combines the policy evaluation and policy improvement in an iterative sequence:

So, policy iteration is a sequence of monotonically improving policies and value functions.

 

 

 

5 of 37

5

Policy iteration (2)

5

5

5

5

  • Each policy is guaranteed to be a strict improvement over the previous one (unless it is already optimal).

  • Because a finite MDP has only a finite number of deterministic policies (why?), this process must converge to an optimal policy and to the optimal value function in a finite number of iterations.

6 of 37

6

Policy iteration example: forest tree MDP

6

6

6

6

(Source, derivative of W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)

7 of 37

7

Policy iteration example: forest tree MDP with “tree hater” initial policy (1)

7

7

7

7

(Source, derivative of W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)

8 of 37

8

Policy iteration example: forest tree MDP with “tree hater” initial policy (2)

8

8

8

8

9 of 37

9

Policy iteration example: forest tree MDP with “tree hater” initial policy (3)

9

9

9

9

10 of 37

10

Policy iteration example: forest tree MDP with “tree hater” initial policy (4)

10

10

10

10

11 of 37

11

Policy iteration example: forest tree MDP with “tree hater” initial policy (5)

11

11

11

11

12 of 37

12

Policy iteration example: forest tree MDP with “tree lover” initial policy (1)

12

12

12

12

(Source, derivative of W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)

13 of 37

13

Policy iteration example: forest tree MDP with “tree lover” initial policy (2)

13

13

13

13

14 of 37

14

Policy iteration example: forest tree MDP with “tree lover” initial policy (3)

14

14

14

14

15 of 37

15

Policy iteration example: forest tree MDP with “tree lover” initial policy (4)

15

15

15

15

16 of 37

16

Policy iteration example: forest tree MDP with “tree lover” initial policy (5)

16

16

16

16

17 of 37

17

Policy iteration algorithm

17

17

17

17

Algorithm: policy iteration

 

Note that each policy evaluation, itself an iterative computation,

is started with the state-value function of the previous policy.

This typically results in a great increase in the speed of convergence of policy evaluation (presumably because the value function changes little from one policy to the next).

18 of 37

18

Example: Jack’s car rental problem (1)

18

18

18

18

18

  • Jack manages two locations for a nationwide car rental company.

  • Car requests:

- first location: modelled as a poisson random variable with mean of 3 cars.

- second location: modelled as a poisson random variable with mean of 4 cars.

  • Car returns:

- first location: modelled as a poisson random variable with mean of 3 cars.

- second location: modelled as a poisson random variable with mean of 2 cars.

  • There can be no more than 20 cars at each location.

  • If Jack has a car available, he rents it out and is credited $10 by the national

company. If he is out of cars at a location, then the business is lost.

  • Jack can move cars between the two locations overnight at a cost of

$2 per car moved.

  • Cars become available for renting the day after they are returned.

  • A maximum of 5 cars can be moved from one location to the other in one night.

  • We take the discount rate to be γ = 0.9.

First location

Second location

19 of 37

19

Example: Jack’s car rental problem (2)

19

19

19

19

19

19

  • This problem is formulated as a continuing finite MDP.

  • The time steps are days, the state is the number of cars at each location at the end of the day.

  • Actions are the net numbers of cars moved between the two locations overnight.

 

Fig.: Sequence of policies found by policy iteration including optimal state value after termination.

20 of 37

20

Value iteration (1)

20

20

20

20

20

20

20

 

21 of 37

21

Value iteration (2)

21

21

21

21

21

21

21

  • The policy evaluation step of policy iteration can be truncated in several ways without losing the convergence guarantees of policy iteration.

  • One important special case is when policy evaluation is stopped after just one sweep (one update of each state). This algorithm is called value iteration.

 

22 of 37

22

Value iteration relationship to Bellman optimality equation

22

22

22

22

22

22

22

Remember Bellman optimality equation for state value:

Remark: Note that value iteration is obtained simply by turning the Bellman optimality equations into an update rule.

And compare with value iteration:

23 of 37

23

Why is value iteration a “policy iteration with a truncated policy evaluation”?

23

23

23

23

23

23

23

Value iteration

meaning one policy evaluation

Policy iteration with a truncated policy evaluation

24 of 37

24

Pseudocode for value iteration

24

24

24

24

24

24

24

Note: compared to policy iteration, value iteration doesn’t require an initial policy but only a state-value guess.

25 of 37

25

Value iteration for forest tree MDP

25

25

25

25

25

25

25

(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)

 

Tab.: Value iteration for forest tree MDP

Note: The value iteration code (in Matlab) for this

example is given in “val_iter.m

26 of 37

26

Summary of dynamic programming algorithms

26

26

26

26

26

26

26

26

26

26

Problem

Bellman equation

Algorithm

Prediction

Bellman expectation equation

Iterative policy evaluation

Control

Bellman expectation equation +

Greedy policy improvement

Policy iteration

Control

Bellman optimality equation

Value iteration

Complexites:

 

27 of 37

27

Asynchronous dynamic programming (1)

27

27

27

27

27

27

27

27

27

27

27

 

28 of 37

28

Asynchronous dynamic programming (2)

28

28

28

28

28

28

28

28

28

28

28

  • Asynchronous DP methods refers to the idea of updating the value estimates of only a subset of

states rather than the entire set of states at every policy evaluation iteration.

  • Asynchronous DP methods simply update the values of states in any order, using the values of

other states that are available.

  • In asynchronous DP, the values of some states may be updated several times before the values

of others are updated once.

  • In asynchronous DP, one may choose a smart update order to achieve a faster overall convergence rate.

  • Asynchronous DP methods are guaranteed to converge if all states continue to be selected.

  • Asynchronous DP methods can significantly reduce the computation time.

29 of 37

29

Asynchronous DP: prioritized sweeping

29

29

29

29

29

29

29

29

29

29

(Source, derivative of W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)

Remember Bellman optimality equation for state value:

  • One can use magnitude of Bellman error (BE) as an indicator which state should be updated next:
  • Update the state with the largest Bellman error first.

  • Build up a priority queue of the most relevant states by refreshing the Bellman error after each state update.

30 of 37

30

Asynchronous DP: real-time updates

30

30

30

30

30

30

30

30

30

30

(Source, derivative of W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)

  • We update those states which are frequently visited by the agent.

  • Here, we utilize agent’s experience to guide the asynchronous DP updates.

  • After each time step, we update the relevant states:

Start states

Irrelevant states:

unreachable from any start state

under any optimal policy

Relevant states:

reachable from some start state

under some optimal policy

Fig.: Real-time DP updates focus on reachable states

31 of 37

31

Generalized policy iteration (GPI) (1)

31

31

31

31

31

31

31

31

31

31

  • Policy evaluation: making the value function consistent with the current policy.

  • Policy improvement: making the policy greedy based on the current value function.

  • Generalized policy iteration (GPI): refer to the general idea of letting policy evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes.

    • The evaluation and improvement processes can be viewed as both competing and cooperating.

    • They compete because they pull in opposing directions:

(i) making the policy greedy with respect to the value function makes the value function incorrect

for the changed policy.

(ii) making the value function consistent with the policy cause that policy no longer to be greedy.

    • They cooperate because in the long run these two processes interact to find a single joint solution: the optimal value function and an optimal policy.

32 of 37

32

Generalized policy iteration (GPI) (2)

32

32

32

32

32

32

32

32

32

32

  • In the standard policy iteration, you alternate between a policy evaluation (PE) step and a policy

improvement (PI) step (i.e. PE, PI, PE, PI, PE, PI, PE, ...).

  • However, in general, you don't have to follow this alternation strictly in order to converge (in the limit)

to the optimal policy.

  • For example, value iteration is an example of a truncated policy iteration that still converges to

the optimal policy.

  • Therefore, the term GPI refers to all algorithms based on policy iteration, such as value iteration, that use PI and PE steps in some order, and that are guaranteed to converge to the optimal policy, provided PE and PI are executed enough times.

33 of 37

33

Generalized policy iteration (GPI) (3)

33

33

33

33

33

33

33

33

33

33

33

  • The two processes together achieve the overall goal of optimality even though neither is attempting

to achieve it directly.

  • In GPI, any policy evaluation algorithm and any policy improvement algorithm are possible.

evaluation

improvement

 

 

 

 

 

 

Fig.: Interpretation of generalized policy iteration to switch back and forth between (arbitrary) evaluations and improvement steps.

policy evaluation line

policy improvement line

 

 

 

 

starting

34 of 37

34

Curse of dimensionality in dynamic programming (1)

34

34

34

34

34

34

34

34

34

34

34

 

(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)

35 of 37

35

Curse of dimensionality in dynamic programming (2)

35

35

35

35

35

35

35

35

35

35

35

 

(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)

36 of 37

36

Dynamic programming in practice

36

36

36

36

36

36

36

36

36

36

36

  • Even for large-scale problems, compared to other methods (such as direct search, linear

programming) for solving MDPs, DP methods are actually quite efficient.

  • Both policy iteration and value iteration are widely used, and it is not clear which, if either, is

better in general.

  • On problems with large state spaces, asynchronous DP methods with variations of GPI

are often preferred.

  • In practice, DP methods can be used with today’s computers to solve MDPs with millions of states.

  • In practice, DP methods usually converge much faster than their theoretical worst-case run times, particularly if they are started with good initial value functions or policies.

37 of 37

References �(utilized for preparation of lecture notes or Matlab code)

37