1 of 22

MLC RL Reading Group

Proximal Policy Optimization Algorithms

2 of 22

Reviewer

Perusha

3 of 22

Summary - Start with PG methods

  • Policy gradient methods are briefly described as methods where a policy gradient is defined and the objective is derived from the gradient. Training is on-policy: interact with env, generate samples, update the policy and discard the old policy data; repeat.

Above: Most common gradient estimator used in PG

Above: Corresponding objective function for optimisation

4 of 22

Summary - Motivation for PPO

  • A big problem with PG methods is indirect control of policy space updates. Distinction between parameter space and policy space. We generally control parameter updates (with 𝜶) but not the corresponding updates to policy space; leads to performance collapse.
  • Performance collapse: a bad update can push the latest iteration of the policy into a bad region of policy space. Given that the new policy generates data for the next update, we get another bad update and this could spiral.
  • This is one of the main motivations for TRPO, PPO and similar algorithms:
    • Focus on controlling how much the policy is updated, constraining the amount of update using various mechanisms.
    • Re-using samples to improve sample efficiency
    • ASIDE: But does this always make sense? What if we want to take a big step in a positive direction? It’s a trade-off… but the reward is monotonically guarantees!!

5 of 22

Summary - TRPO

  • In TRPO a new objective is derived (surrogate objective) and the policy update is constrained (using a KL penalty) to a region close to the previous policy
  • Problems with TRPO: Beta is fixed and the algorithm is complicated to optimise...

http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_advanced_pg.pdf

Quick explanation of TRPO also here: https://jonathan-hui.medium.com/rl-proximal-policy-optimization-ppo-explained-77f014ec3f12

6 of 22

Summary - PPO Adaptive Penalty co-eff

  • Similar to objective described in TRPO, but uses an automatically adaptive/ responsive 𝛽
  • Where 𝛽 is calculated as below: if the policies move apart too far we increase the penalty and vice versa.

7 of 22

Summary - PPO Clipped

  • PPO clipped approach uses simpler mechanism to constrain size of policy update:
  • Maximising the original objective LCPI with r unconstrained will lead to large increases.
  • In LCLIP r is clipped to the range shown and the objective is the expectation of the min between the original LCPI and a clipped version of LCPI: taking a lower bound on the original objective. Another way to control size of updates we take.

8 of 22

Summary - PPO Clipped

http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_advanced_pg.pdf

9 of 22

Summary - Implementation

  • Authors show how easy it is to replace LPG with LCLIP in a policy gradient implementation

10 of 22

Summary - Experiments - First comparing clipping surrogate obj to other versions of the surrogate obj

11 of 22

Summary - Experiments - Next compare PPO clipping to other popular algorithms

12 of 22

Summary - Experiments - Humanoid high dimensional continuous control tasks

13 of 22

Review Questions:

  • Originality: considering TRPO and use of KL this was an unexpected twist towards a much simpler solution!
  • Quality: comparisons with other algorithms clearly made
  • Clarity: Well written, concise and to the point but definitely dives into the deep end; expected to know prior work well enough to be able to read this paper.
  • Significance: Yes, very hard to argue with this!!

  • Score: 10 Confidence Score: 3

14 of 22

Resources and References

  • Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. “Proximal Policy Optimization Algorithms.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/1707.06347.
  • DRL course (Levine): http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_advanced_pg.pdf
  • Intuitive descriptions of PPO, TRPO, etc. https://jonathan-hui.medium.com/rl-proximal-policy-optimization-ppo-explained-77f014ec3f12
  • https://openai.com/blog/openai-baselines-ppo/

15 of 22

Researcher (Prashant) - Bot for Gardenscapes

Applications in puzzle game:

GardenScape

  • Advantage functions which can be used for different levels
  • Easier to code and tune compared to TRPO

16 of 22

Archaeologist

Past Papers:

  • Richard S. Sutton, David McAllester. Policy Gradient Methods for Reinforcement Learning with Function Approximation
  • J. Schulman, S. Levine, P. Moritz : Trust region policy optimization

17 of 22

Policy Gradient Methods

  • Vanilla Policy Gradient faces the issue of high variance when trying to approximate the gradient estimates
  • In subsequent updates, we can not ensure that the policy is always improving only, due to high variance
  • Many of the methods build on VPG and address ways to reduce variance, increase robustness and stability

18 of 22

Trust Region Policy Optimization

  • TRPO addresses the problem of reducing the variance by defining a TRUST REGION for the policy updates.
  • By ensuring a threshold increment for subsequent policies, it ensures monotonic policy improvement
  • TRPO defines this threshold in the terms of KL Divergence calculated on the two policy distributions (old and new).
  • Keeping an upper limit for KL Divergence ensures that the policy doesn't deviate too much, hence decreasing the variance of the updates

19 of 22

Proximal Policy Optimization

  • PPO works on a similar fundamental thought as TRPO of defining a trust region
  • TRPO in practice is complicated to implement, so PPO reduces that complexity while ensuring increase in sample efficiancy, robustness and stability
  • PPO creates the region by clipping the probability ratios of the new and old policies within a range (controlled by a hyperparameter)
  • PPO overcomes one more implementation hurdle compared to TRPO is that it makes the optimization much more compatible with SGD as TRPO is optimized using the conjugate gradient method
  • One modification introduced in paper while applying to network architectures was to introduce an Entropy Component to encourage exploration

20 of 22

Comparison of PPO with other methods

21 of 22

Archaeologist

Future Papers:

  • Chloe Ching-Yun Hsu, Celestine Mendler-Dünner, Moritz Hardt : Revisiting Design Choices in Proximal Policy Optimization
  • Oriol Vinyals, Igor Babuschkin, David Silver: Grandmaster level in StarCraft II using multi-agent reinforcement learning
  • Christopher Berner, Greg Brockman, Brooke Chan: Dota 2 with Large Scale Deep Reinforcement Learning

22 of 22

Hacker