Lecture 9��Proximal Policy Optimization (PPO)
1
Instructor: Ercan Atam
Institute for Data Science & Artificial Intelligence
Course: DSAI 642- Advanced Reinforcement Learning
2
List of contents for this lecture
3
Relevant readings/videos for this lecture
Theory and Practice in Python”, Addison-Wesley Professional, 2019.
4
Issues in policy gradient methods
5
What is PPO (Proximal Policy Optimization)?
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov, “Proximal Policy Optimization Algorithms”, https://arxiv.org/abs/1707.06347, 2017
to encourage stable, approximately monotonic policy improvement and thereby reduce the risk of
performance collapse.
PPO leads to more stable and more sample-efficient training.
6
Performance collapse
7
Trust region policy optimization (1)
(See Appendix for the proof)
8
Trust region policy optimization (2)
(proof is an exercise for you.)
9
Trust region policy optimization (3)
Achiam, J., Held, D., Tamar, A., and Abbeel, P. “Constrained Policy Optimization.” 2017, https://arxiv.org/abs/1705.10528
10
Trust region policy optimization (4)
11
Trust region policy optimization (5)
12
Trust region policy optimization (6)
13
Trust region policy optimization (7)
14
Trust region policy optimization (8)
15
Proximal Policy Optimization (PPO) (1)
16
Proximal Policy Optimization (PPO) (2)
17
Proximal Policy Optimization (PPO) (3)
per-time step expectation
18
Proximal Policy Optimization (PPO) (4)
19
Proximal Policy Optimization (PPO) (5)
20
Proximal Policy Optimization (PPO) (6)
21
Proximal Policy Optimization (PPO) (7)
22
PPO is on-policy (1)
23
PPO is on-policy (2)
24
Sample efficiency of PPO
25
PPO can be used for both continuous and discrete action spaces
26
PPO Algorithm
27
+s, -s (1)
28
+s, -s (2)
-s:
but the ratio slightly exceeds bounds.
29
Appendix
References �(utilized for preparation of lecture notes or MATLAB code)
30