3 of 22

Summary - Start with PG methods

Policy gradient methods are briefly described as methods where a policy gradient is defined and the objective is derived from the gradient. Training is on-policy: interact with env, generate samples, update the policy and discard the old policy data; repeat.

Above: Most common gradient estimator used in PG

Above: Corresponding objective function for optimisation

4 of 22

Summary - Motivation for PPO

A big problem with PG methods is indirect control of policy space updates. Distinction between parameter space and policy space. We generally control parameter updates (with 𝜶) but not the corresponding updates to policy space; leads to performance collapse.
Performance collapse: a bad update can push the latest iteration of the policy into a bad region of policy space. Given that the new policy generates data for the next update, we get another bad update and this could spiral.
This is one of the main motivations for TRPO, PPO and similar algorithms:

Focus on controlling how much the policy is updated, constraining the amount of update using various mechanisms.
Re-using samples to improve sample efficiency
ASIDE: But does this always make sense? What if we want to take a big step in a positive direction? It’s a trade-off… but the reward is monotonically guarantees!!

5 of 22

Summary - TRPO

In TRPO a new objective is derived (surrogate objective) and the policy update is constrained (using a KL penalty) to a region close to the previous policy

Problems with TRPO: Beta is fixed and the algorithm is complicated to optimise...

http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_advanced_pg.pdf

Quick explanation of TRPO also here: https://jonathan-hui.medium.com/rl-proximal-policy-optimization-ppo-explained-77f014ec3f12

6 of 22

Summary - PPO Adaptive Penalty co-eff

Similar to objective described in TRPO, but uses an automatically adaptive/ responsive 𝛽

Where 𝛽 is calculated as below: if the policies move apart too far we increase the penalty and vice versa.

7 of 22

Summary - PPO Clipped

PPO clipped approach uses simpler mechanism to constrain size of policy update:

Maximising the original objective L^CPI with r unconstrained will lead to large increases.
In L^CLIP r is clipped to the range shown and the objective is the expectation of the min between the original L^CPIand a clipped version of L^CPI: taking a lower bound on the original objective. Another way to control size of updates we take.

8 of 22

Summary - PPO Clipped

http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_advanced_pg.pdf

9 of 22

Summary - Implementation

Authors show how easy it is to replace L^PG with L^CLIP in a policy gradient implementation

10 of 22

Summary - Experiments - First comparing clipping surrogate obj to other versions of the surrogate obj

11 of 22

Summary - Experiments - Next compare PPO clipping to other popular algorithms

12 of 22

Summary - Experiments - Humanoid high dimensional continuous control tasks

13 of 22

Review Questions:

Originality: considering TRPO and use of KL this was an unexpected twist towards a much simpler solution!
Quality: comparisons with other algorithms clearly made
Clarity: Well written, concise and to the point but definitely dives into the deep end; expected to know prior work well enough to be able to read this paper.
Significance: Yes, very hard to argue with this!!

Score: 10 Confidence Score: 3

14 of 22

Resources and References

Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. “Proximal Policy Optimization Algorithms.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/1707.06347.
DRL course (Levine): http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_advanced_pg.pdf
Intuitive descriptions of PPO, TRPO, etc. https://jonathan-hui.medium.com/rl-proximal-policy-optimization-ppo-explained-77f014ec3f12
https://openai.com/blog/openai-baselines-ppo/

15 of 22

Researcher (Prashant) - Bot for Gardenscapes

Applications in puzzle game:

GardenScape

Advantage functions which can be used for different levels
Easier to code and tune compared to TRPO

�

16 of 22

Archaeologist

Past Papers:

Richard S. Sutton, David McAllester. Policy Gradient Methods for Reinforcement Learning with Function Approximation
J. Schulman, S. Levine, P. Moritz : Trust region policy optimization

17 of 22

Policy Gradient Methods

Vanilla Policy Gradient faces the issue of high variance when trying to approximate the gradient estimates
In subsequent updates, we can not ensure that the policy is always improving only, due to high variance
Many of the methods build on VPG and address ways to reduce variance, increase robustness and stability

18 of 22

Trust Region Policy Optimization

TRPO addresses the problem of reducing the variance by defining a TRUST REGION for the policy updates.
By ensuring a threshold increment for subsequent policies, it ensures monotonic policy improvement
TRPO defines this threshold in the terms of KL Divergence calculated on the two policy distributions (old and new).
Keeping an upper limit for KL Divergence ensures that the policy doesn't deviate too much, hence decreasing the variance of the updates

19 of 22

Proximal Policy Optimization

PPO works on a similar fundamental thought as TRPO of defining a trust region
TRPO in practice is complicated to implement, so PPO reduces that complexity while ensuring increase in sample efficiancy, robustness and stability
PPO creates the region by clipping the probability ratios of the new and old policies within a range (controlled by a hyperparameter)
PPO overcomes one more implementation hurdle compared to TRPO is that it makes the optimization much more compatible with SGD as TRPO is optimized using the conjugate gradient method
One modification introduced in paper while applying to network architectures was to introduce an Entropy Component to encourage exploration

20 of 22

Comparison of PPO with other methods

21 of 22