1 of 30

Lecture 9��Proximal Policy Optimization (PPO)

1

Instructor: Ercan Atam

Institute for Data Science & Artificial Intelligence

Course: DSAI 642- Advanced Reinforcement Learning

2 of 30

2

List of contents for this lecture

Motivations and ideas behind PPO

Math behind PPO

The PPO Algorithm

3 of 30

3

Relevant readings/videos for this lecture

Chapter 7 of Laura Graesser and Wah Loon Keng, “Foundations of Deep Reinforcement Learning:

Theory and Practice in Python”, Addison-Wesley Professional, 2019.

https://www.youtube.com/watch?v=xgbe95BxY7k

https://web.stanford.edu/class/cs234/slides/lecture6pre.pdf

Chapter 8 of Nimish Sanghi, “Deep Reinforcement Learning with Python”, 2nd Edition, Apress, 2024.

Chapter 12 of Miguel Mirales, “Grokking Deep Reinforcement Learning”, Manning, 2020.

https://huggingface.co/learn/deep-rl-course/unit8/intuition-behind-ppo

https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf

4 of 30

4

Issues in policy gradient methods

Training agents with on-policy policy gradient algorithms can lead to performance collapse, where the agent’s performance suddenly degrades. (Why can this happen?)

Once performance collapses, recovery is difficult because the current (bad) policy generates low-quality trajectories, and these are exactly the trajectories used for further training.

Moreover, on-policy methods are sample-inefficient: they must discard old data and cannot fully reuse past experience, unlike off-policy algorithms that learn from replay buffers.

5 of 30

5

What is PPO (Proximal Policy Optimization)?

Proximal Policy Optimization (PPO) proposed by Schulman et. al. (2017) is a class of optimization algorithms that address the issues mentioned before.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov, “Proximal Policy Optimization Algorithms”, https://arxiv.org/abs/1707.06347, 2017

The main idea behind PPO is to introduce a surrogate objective that constrains policy updates so as

to encourage stable, approximately monotonic policy improvement and thereby reduce the risk of

performance collapse.

PPO can reuse the same batch for several gradient steps for policy update (even as the policy changes slightly), improving sample efficiency while still being essentially on-policy.

PPO leads to more stable and more sample-efficient training.

6 of 30

6

Performance collapse

7 of 30

7

Trust region policy optimization (1)

(See Appendix for the proof)

8 of 30

8

Trust region policy optimization (2)

(proof is an exercise for you.)

9 of 30

9

Trust region policy optimization (3)

10 of 30

Achiam, J., Held, D., Tamar, A., and Abbeel, P. “Constrained Policy Optimization.” 2017, https://arxiv.org/abs/1705.10528

10

Trust region policy optimization (4)

11 of 30

11

Trust region policy optimization (5)

12 of 30

12

Trust region policy optimization (6)

13 of 30

13

Trust region policy optimization (7)

14 of 30

14

Trust region policy optimization (8)

15 of 30

15

Proximal Policy Optimization (PPO) (1)

16 of 30

16

Proximal Policy Optimization (PPO) (2)

17 of 30

17

Proximal Policy Optimization (PPO) (3)

per-time step expectation

18 of 30

18

Proximal Policy Optimization (PPO) (4)

19 of 30

19

Proximal Policy Optimization (PPO) (5)

20 of 30

20

Proximal Policy Optimization (PPO) (6)

21 of 30

21

Proximal Policy Optimization (PPO) (7)

22 of 30

22

PPO is on-policy (1)

23 of 30

23

PPO is on-policy (2)

24 of 30

24

Sample efficiency of PPO

25 of 30

25

PPO can be used for both continuous and discrete action spaces

26 of 30

26

PPO Algorithm

27 of 30

27

+s, -s (1)

28 of 30

28

+s, -s (2)

-s:

Not truly sample-efficient: Although more efficient than REINFORCE, PPO discards data after a few updates and cannot learn from large replay buffers like off-policy methods.

No explicit long-term memory: Standard PPO only learns from the data in the current batch and does not explicitly store information from earlier policies or episodes. As a result, it struggles to exploit long-horizon structure unless additional mechanisms (e.g., recurrent networks or specialized memory architectures) are built in.

Clipping may under-optimize: The clipped objective may prevent useful updates when the advantage is high,

but the ratio slightly exceeds bounds.

No true theoretical guarantee: PPO is an empirical approximation to TRPO and lacks strong convergence theory, though it performs well in practice.

29 of 30

29

Appendix

30 of 30

References �(utilized for preparation of lecture notes or MATLAB code)

Laura Graesser and Wah Loon Keng, “Foundations of Deep Reinforcement Learning: Theory and Practice in Python”, Addison-Wesley Professional, 2019.

https://www.youtube.com/watch?v=xgbe95BxY7k

https://web.stanford.edu/class/cs234/slides/lecture6pre.pdf

Nimish Sanghi, “Deep Reinforcement Learning with Python”, 2nd Edition, Apress, 2024.

Miguel Mirales, “Grokking Deep Reinforcement Learning”, Manning, 2020.

https://huggingface.co/learn/deep-rl-course/unit8/intuition-behind-ppo

https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf

30