1 of 48

Enhancing Sampling Efficiency in RL

Mohamed Elnoor

Department of Electrical and Computer Engineering

University of Maryland

11/2/2023

2 of 48

Introduction

  • Exploring sampling efficiency in Reinforcement Learning (RL).
  • The role of Guided Policy Search in trajectory optimization.
  • Enhancing RL with differential programming for better sample efficiency.

3 of 48

Contents

  • Why Reinforcement Learning?
  • Reinforcement Learning Overview
  • Value Learning vs. Policy Learning
  • Gradient Use in Policy Learning
  • Sample Efficiency in RL
  • Guided Policy Search (S. Levine and V. Koltun)
  • Trust Region Policy Optimization (TRPO) (Schulman et al.)
  • Proximal Policy Optimization (PPO) (Schulman et al.)
  • Gradient Informed PPO (GI-PPO)(Son et al.)
  • Summary

4 of 48

Machine Learning Methods

https://www.mathworks.com/discovery/reinforcement-learning.html

5 of 48

Why Reinforcement Learning?

  • Navigating uncertainty and complexity in dynamic environments.
  • Adapting to new data and experiences.
  • From AlphaGo's triumph over human champions to robotic automation in warehouses.

6 of 48

7 of 48

Reinforcement Learning Overview

8 of 48

Reinforcement Learning Overview

  • Agent: The decision maker.
  • Environment: What the agent interacts with.
  • Action (A): What the agent can do.
  • State (S): Current situation of the agent.
  • Reward (R): Feedback from the environment.

9 of 48

Reinforcement Learning Overview

  • Value Function V (s)
    • Represents the expected cumulative reward starting from states.

  • Quality Function Q(s, a)
    • Represents the expected cumulative reward for taking action a in states.

  • Policy π(s)
    • Specifies the action an agent will take in a given state.

10 of 48

Reinforcement Learning Overview: RL as MDP

Markov Decision Process (MDP): An MDP defines the environment in which an RL agent operates, characterized by:

  • States S: All possible situations in the environment.
  • Actions A: All possible moves the agent can make.
  • Rewards R: Immediate reward received after taking an action in a state.
  • Transitions P: Probabilities of moving from one state to another.

The agent aims to determine the optimal policy that maximizes the

expected cumulative reward:

11 of 48

Value Learning vs. Policy Learning

Value Learning Policy Learning

Find Q(s,a) Find π(s)

a = argmax Q(s,a) Sample a ~ π(s)

  • Trade-offs:
    • Value learning can reuse data but may need a model of the environment. Policy learning can be model-free but may require more samples.
    • Value learning is universal but may struggle in high-dimensional action spaces. Policy learning can be more direct and intuitive.
    • Both deal with exploration-exploitation trade-offs.

12 of 48

Value Learning

Wang, Y., Wang, L., & Dong, X. (2021). An intelligent TCP congestion control method based on deep Q network. Future Internet, 13(10), 261.

13 of 48

14 of 48

Policy Learning

http://introtodeeplearning.com/

15 of 48

Policy Learning

  • Training Algorithm:
  • Initialize the agent
  • Run a policy until termination
  • Record all states, actions, rewards
  • Decrease probability of actions that resulted in low reward
  • Increase probability of actions that resulted in high reward
  • Loss Function:

loss = − log P(a|st ) · Rt

  • Gradient Descent Update:

w ′ = w − ∇loss

w ′ = w + ∇ log P(a|st ) · Rt

http://introtodeeplearning.com/

16 of 48

Use of Gradient in Policy Learning

  • Gradient Descent - Steepest descent method to find local minima or maxima.
  • Policy Gradient: Adjusting policy parameters in the direction that maximizes expected rewards.
  • Challenges on the Road:
    • Vanishing Gradients: When updates get too small, slowing learning.
    • Exploding Gradients: When updates become too large, causing instability.

17 of 48

18 of 48

Sample Efficiency in Policy Learning

Motivation:

  • In policy learning, obtaining samples (experiences) can be expensive.
  • Each sample can involve real-world interactions or costly simulations.
  • Efficiently using these samples is crucial for faster and more stable learning.

Why It Matters:

  • Reduces the training time.
  • Conserves resources and costs.
  • Achieves better convergence in fewer episodes.

19 of 48

20 of 48

Exploration-Exploitation Dilemma

The Dilemma:

  • Exploitation: Stick to known strategies to get expected
  • rewards.
  • Exploration: Try new strategies in hope of discovering better
  • rewards.

Risks of Pure Exploitation:

  • Can lead to local optima in the search space.
  • Might miss out on more rewarding trajoctries or strategies.

21 of 48

Guided Policy Search

  • Proposed Solution:
    • Trajectory optimization to guide policy search.
    • Use of DDP for "guiding samples".
    • Incorporating samples into policy search with a unique estimator.
  • Main Contribution:
    • A guided policy search method, leveraging trajectory optimization.
    • Tested on various tasks showing adaptability and high performance.

22 of 48

Policy Optimization: A refersher

Policy Gradient: Optimize E [J(π)]

▶ Parameters: θ

▶ Gradient Estimate:

23 of 48

Concept and Usage of Importance Sampling

Importance Sampling:

  • A statistical technique to estimate properties of a particular distribution.
  • Expectation:

Usage:

  • Useful when p is hard to sample from, but q isn’t.
  • Often p might be complex or infeasible to directly sample from.
  • With importance sampling, we use q which is more tractab

24 of 48

Variance Reduction:

  • One of the challenges with importance sampling is high variance.
  • Techniques exist to reduce this variance, especially when dealing with finite action distributions

25 of 48

Constructing Guiding Distributions

  • An effective guiding distribution covers high-reward regions while avoiding large (ψ) weights. The KL-divergence is introduced with respect to high-reward states.

26 of 48

Adaptive Guiding Distributions

The distribution in the preceding section captures high-reward regions, but does not consider the current policy. It can be adapted to the policy by sampling from an l-projection of p(ξ) ∝ exp(r (ξ)/η), which is the optimal distribution for estimating Ep(ξ)[r (ξ)].

π∗(x, u) = πAD (x) + log πGD (x, u)

The resulting distribution is then an approximate l-projection of

p(ξ) ∝ exp(r (ξ))

27 of 48

Incorporating Guiding Samples

  • Guiding samples assist policy search by incorporating DDP solutions.
  • These solutions can be initiated using human demonstrations or from an off-the-shelf search algorithm.
    • Guiding samples are adaptive when used.
    • Each iteration of the policy search deviates from previous DDP solutions.

28 of 48

29 of 48

Experiments

30 of 48

Trusted Region Algorithms

31 of 48

Trust Region Policy Optimization (TRPO)

  • Shortcomings of Traditional Policy Gradient Methods:
    • Risk of large policy updates: Leads to destabilization and potential divergence.
    • Inconsistency: The same action can have very different implications based on how much it deviates from the previous policy.
  • Trust Region:
    • Ensures updates to the policy are bounded, so they don't deviate too much.
    • Uses the natural policy gradient and constrains the step size in policy space to ensure safe updates.
  • Why TRPO?
    • Stability: By avoiding drastic changes, TRPO avoids the pitfalls of big leaps that might collapse performance.
    • Efficiency: Converges to a solution faster than traditional policy gradient methods by ensuring each step is productive.

32 of 48

TRPO

Advantage Function

33 of 48

TRPO

34 of 48

Proximal Policy Optimization (PPO)

  • TRPO's Cons:
    • Computationally expensive due to second-order optimization.
    • Implementational complexity: Requires careful tuning and is hard to implement in practice.
  • PPO's Simplification:
    • Introduces a "clipped" objective function, ensuring the policy doesn't change too drastically.
    • Avoids the complicated adaptive step size and the use of higher-order derivatives.
  • The of PPO:
    • Provides many of TRPO's benefits but is simpler and more scalable.
    • Due to its efficacy and simplicity, it's become a standard in many deep RL applications.

35 of 48

TRPO vs PPO

TRPO

PPO

36 of 48

PPO algorithm

37 of 48

38 of 48

GI-PPO

  • Combines PPO sampling with analytical gradients.
  • Motivated by: high variance, bias, cost of LR+RP, imbalance of sampling vs gradients, handling biased gradients.
  • Introduces α-policy to adaptively integrate sampling and gradients.

39 of 48

Types of gradients

Analytical gradients:

  • Derived from differentiable environmental dynamics.
  • Directly represent action-result relationships. Utilized in programming tools

Likelihood ratio (LR) gradient:

  • Uses the log-derivative. Unbiased but can have high variance.
  • Does not require knowledge of environment dynamics.

Reparameterization (RP) gradient:

  • Uses the reparameterization trick and requires analytical gradients of the dynamics.
  • Typically lower variance but can suffer from exploding variance.
  • Subject to potential empirical bias.

40 of 48

Limitations of Prior Work

  • LR gradients: high variance, control variates introduce bias.
  • RP gradients: can still explode, empirical bias.
  • LR+RP: interpolating requires estimating both, costly.
  • PPO: unclear how to leverage RP gradients.
  • Gradient-only: no stability of sampling. Sampling-only: no gradient info.

41 of 48

Sampling in GI-PPO

Differentiable environment provides analytical gradients for observation and reward w.r.t. previous action:

Used to compute gradient of advantage function.

Policy Update

RP Gradient:

  • Defines πθ through a bijection gθ, mapping noise to action.

42 of 48

Adaptive α

  • The α-policy bridges the gap between PPO’s original policy and the target policy for RP gradient.
  • Adjusting α allows us to strike a balance between exploiting analytical gradients and following PPO’s policy update rules.

Adaptive Adjustments of α:

1. Variance: Adjust α to ensure stable policy updates, considering eigenvalues of the gradient.

2. Bias: Decrease α if analytical gradients seem biased.

3. Out-of-range-ratio: Ensure PPO’s restrictions are upheld.

PPO Update with α:

43 of 48

44 of 48

Takeaways

  • Introduced analytical gradients into the PPO framework for enhanced policy learning.
  • α-policy adapt the current policy, allowing variance and bias estimation of gradients.
  • Developed adaptive α-algorithm to adjust the influence of analytical gradients.
  • Outperformed baseline PPO in diverse RL environments.
  • Demonstrated superiority over methods like interpolated gradient due to stable PPO learning.

45 of 48

Limitations & Future Work

  • Approximation Challenge: Achieving true α-policy approximation remains elusive even with large α values.
  • PPO Dependence: GI-PPO close ties with PPO restrict the full potential of analytical gradients.
  • Computational Overhead: Use of backpropagation for analytical gradients demands increased training time.
  • Fixed PPO Boundaries: PPO's set clipping range might compromise the convergence rate in certain scenarios.

46 of 48

Summary

  • Overview of Reinforcement Learning
  • Value Learning vs Policy Learning
  • Sample Efficiency in RL
  • Exploitation-Exploration Dilemma
  • Guiding the policy search
  • Trust Region Algorithms
  • Gredient Informed PPO

47 of 48

References

Levine, S. and Koltun, V., 2013, May. Guided policy search. In International conference on machine learning (pp. 1-9). PMLR.

Schulman, J., Levine, S., Abbeel, P., Jordan, M. and Moritz, P., 2015, June. Trust region policy optimization. In International conference on machine learning (pp. 1889-1897). PMLR.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

Son, S., Zheng LY., Sullivan, R., Qiao, Y., Lin, MC . Gradient Informed Proximal Policy Optimization. Advances in Neural Information Processing Systems. 2023

48 of 48

Questions