1 of 48

Enhancing Sampling Efficiency in RL

Mohamed Elnoor

Department of Electrical and Computer Engineering

University of Maryland

11/2/2023

2 of 48

Introduction

Exploring sampling efficiency in Reinforcement Learning (RL).
The role of Guided Policy Search in trajectory optimization.
Enhancing RL with differential programming for better sample efficiency.

3 of 48

Contents

Why Reinforcement Learning?
Reinforcement Learning Overview
Value Learning vs. Policy Learning
Gradient Use in Policy Learning
Sample Efficiency in RL
Guided Policy Search (S. Levine and V. Koltun)
Trust Region Policy Optimization (TRPO) (Schulman et al.)
Proximal Policy Optimization (PPO) (Schulman et al.)
Gradient Informed PPO (GI-PPO)(Son et al.)
Summary

4 of 48

Machine Learning Methods

https://www.mathworks.com/discovery/reinforcement-learning.html

5 of 48

Why Reinforcement Learning?

Navigating uncertainty and complexity in dynamic environments.
Adapting to new data and experiences.
From AlphaGo's triumph over human champions to robotic automation in warehouses.

6 of 48

7 of 48

Reinforcement Learning Overview

8 of 48

Reinforcement Learning Overview

Agent: The decision maker.
Environment: What the agent interacts with.
Action (A): What the agent can do.
State (S): Current situation of the agent.
Reward (R): Feedback from the environment.

9 of 48

Reinforcement Learning Overview

Value Function V (s)

Represents the expected cumulative reward starting from states.

Quality Function Q(s, a)

Represents the expected cumulative reward for taking action a in states.

Policy π(s)

Specifies the action an agent will take in a given state.

10 of 48

Reinforcement Learning Overview: RL as MDP

Markov Decision Process (MDP): An MDP defines the environment in which an RL agent operates, characterized by:

States S: All possible situations in the environment.
Actions A: All possible moves the agent can make.
Rewards R: Immediate reward received after taking an action in a state.
Transitions P: Probabilities of moving from one state to another.

The agent aims to determine the optimal policy that maximizes the

expected cumulative reward:

11 of 48

Value Learning vs. Policy Learning

Value Learning Policy Learning

Find Q(s,a) Find π(s)

a = argmax Q(s,a) Sample a ~ π(s)

Trade-offs:

Value learning can reuse data but may need a model of the environment. Policy learning can be model-free but may require more samples.
Value learning is universal but may struggle in high-dimensional action spaces. Policy learning can be more direct and intuitive.
Both deal with exploration-exploitation trade-offs.

12 of 48

Value Learning

Wang, Y., Wang, L., & Dong, X. (2021). An intelligent TCP congestion control method based on deep Q network. Future Internet, 13(10), 261.

13 of 48

14 of 48

Policy Learning

http://introtodeeplearning.com/

15 of 48

Policy Learning

Training Algorithm:
Initialize the agent
Run a policy until termination
Record all states, actions, rewards
Decrease probability of actions that resulted in low reward
Increase probability of actions that resulted in high reward
Loss Function:

loss = − log P(a|st ) · Rt

Gradient Descent Update:

w ′ = w − ∇loss

w ′ = w + ∇ log P(a|st ) · Rt

http://introtodeeplearning.com/

16 of 48

Use of Gradient in Policy Learning

Gradient Descent - Steepest descent method to find local minima or maxima.
Policy Gradient: Adjusting policy parameters in the direction that maximizes expected rewards.
Challenges on the Road:

Vanishing Gradients: When updates get too small, slowing learning.
Exploding Gradients: When updates become too large, causing instability.

17 of 48

18 of 48

Sample Efficiency in Policy Learning

Motivation:

In policy learning, obtaining samples (experiences) can be expensive.
Each sample can involve real-world interactions or costly simulations.
Efficiently using these samples is crucial for faster and more stable learning.

Why It Matters:

Reduces the training time.
Conserves resources and costs.
Achieves better convergence in fewer episodes.

19 of 48

20 of 48

Exploration-Exploitation Dilemma

The Dilemma:

Exploitation: Stick to known strategies to get expected
rewards.
Exploration: Try new strategies in hope of discovering better
rewards.

Risks of Pure Exploitation:

Can lead to local optima in the search space.
Might miss out on more rewarding trajoctries or strategies.

21 of 48

Guided Policy Search

Proposed Solution:

Trajectory optimization to guide policy search.
Use of DDP for "guiding samples".
Incorporating samples into policy search with a unique estimator.

Main Contribution:

A guided policy search method, leveraging trajectory optimization.
Tested on various tasks showing adaptability and high performance.

22 of 48

Policy Optimization: A refersher

Policy Gradient: Optimize E [J(π)]

▶ Parameters: θ

▶ Gradient Estimate:

23 of 48

Concept and Usage of Importance Sampling

Importance Sampling:

A statistical technique to estimate properties of a particular distribution.
Expectation:

Usage:

Useful when p is hard to sample from, but q isn’t.
Often p might be complex or infeasible to directly sample from.
With importance sampling, we use q which is more tractab

24 of 48

Variance Reduction:

One of the challenges with importance sampling is high variance.
Techniques exist to reduce this variance, especially when dealing with finite action distributions

25 of 48

Constructing Guiding Distributions

An effective guiding distribution covers high-reward regions while avoiding large (ψ) weights. The KL-divergence is introduced with respect to high-reward states.

26 of 48

Adaptive Guiding Distributions

The distribution in the preceding section captures high-reward regions, but does not consider the current policy. It can be adapted to the policy by sampling from an l-projection of p(ξ) ∝ exp(r (ξ)/η), which is the optimal distribution for estimating Ep(ξ)[r (ξ)].

π∗(x, u) = πAD (x) + log πGD (x, u)

The resulting distribution is then an approximate l-projection of

p(ξ) ∝ exp(r (ξ))

27 of 48

Incorporating Guiding Samples

Guiding samples assist policy search by incorporating DDP solutions.
These solutions can be initiated using human demonstrations or from an off-the-shelf search algorithm.

Guiding samples are adaptive when used.
Each iteration of the policy search deviates from previous DDP solutions.

28 of 48

Initialization:

Begin by generating one or more DDP solutions (Step 1).

Sample trajectories (or episodes) from these DDP solutions (Step 2).

With these trajectories, initialize the policy by maximizing its likelihood on these trajectories (Step 3).

Construct a sample set using the guiding trajectories and some samples from the initialized policy (Step 4).

Policy Iteration:

From the sample set, a subset is selected based on their importance weights for the current best policy and the guiding samples (Step 6).

Optimize the policy using the current subset of samples (Step 7).

Append new samples obtained from the optimized policy to the sample set (Step 8).

Adaptive Guiding Samples:

Optionally, if using adaptive guiding samples, generate new guiding samples based on the current state of the policy (Step 9).

Compare the performance of the current policy and the best-known policy using the current sample set (Step 10).

Policy Evaluation and Adjustment:

If the newly optimized policy performs better than the best-known policy, update the best policy and adjust the regularization weight (Steps 11-13).

Otherwise, increase the regularization weight and potentially resample from the current best policy to avoid getting stuck in local optima (Steps 14-16).

Return the Best Policy:

After iterating for a set number of iterations, return the best-found policy (Step 19).

29 of 48

Experiments

30 of 48

Trusted Region Algorithms

31 of 48

Trust Region Policy Optimization (TRPO)

Shortcomings of Traditional Policy Gradient Methods:

Risk of large policy updates: Leads to destabilization and potential divergence.
Inconsistency: The same action can have very different implications based on how much it deviates from the previous policy.

Trust Region:

Ensures updates to the policy are bounded, so they don't deviate too much.
Uses the natural policy gradient and constrains the step size in policy space to ensure safe updates.

Why TRPO?

Stability: By avoiding drastic changes, TRPO avoids the pitfalls of big leaps that might collapse performance.
Efficiency: Converges to a solution faster than traditional policy gradient methods by ensuring each step is productive.

32 of 48

TRPO

Advantage Function

33 of 48

TRPO

Initialization: Start with a policy

Advantage Calculation: In each iteration, the algorithm computes the advantage of actions. It gauges how good an action is in a specific state, relative to the average action for that state.

Optimization with Constraint:

TRPO aims to maximize the expected return (L_π(πₜ)) of the policy.

Simultaneously, it ensures that the new policy doesn't stray too far from the old one, measured by the KL divergence (D_KL).

The constraint guarantees that the policy update stays within a predefined 'trust region', controlled by the value 'C'.

Convergence: Iterations continue until the policy stabilizes, ensuring maximum return without risking significant policy deterioration.

TRPO balances policy improvements with stability, avoiding common pitfalls in other policy gradient techniques by constraining updates.

34 of 48

Proximal Policy Optimization (PPO)

TRPO's Cons:

Computationally expensive due to second-order optimization.
Implementational complexity: Requires careful tuning and is hard to implement in practice.

PPO's Simplification:

Introduces a "clipped" objective function, ensuring the policy doesn't change too drastically.
Avoids the complicated adaptive step size and the use of higher-order derivatives.

The of PPO:

Provides many of TRPO's benefits but is simpler and more scalable.
Due to its efficacy and simplicity, it's become a standard in many deep RL applications.

35 of 48

TRPO vs PPO

TRPO

PPO

36 of 48

PPO algorithm

37 of 48

38 of 48

GI-PPO

Combines PPO sampling with analytical gradients.
Motivated by: high variance, bias, cost of LR+RP, imbalance of sampling vs gradients, handling biased gradients.
Introduces α-policy to adaptively integrate sampling and gradients.

39 of 48

Types of gradients

Analytical gradients:

Derived from differentiable environmental dynamics.
Directly represent action-result relationships. Utilized in programming tools

Likelihood ratio (LR) gradient:

Uses the log-derivative. Unbiased but can have high variance.
Does not require knowledge of environment dynamics.

Reparameterization (RP) gradient:

Uses the reparameterization trick and requires analytical gradients of the dynamics.
Typically lower variance but can suffer from exploding variance.
Subject to potential empirical bias.

Analytical gradients provide a direct mapping from actions to their results in environments that are differentiable, such as physics simulations.

�LR is more of a trial-and-error method. It tries different actions, observes the outcomes, and then uses these samples to adjust the policy. It's unbiased because it's based on real samples, but it can be noisy or have high variance since it's based on random trials.

LR Common in REINFORCE and other stochastic policy approaches.

RP gradient bridges the gap between the two. It uses the reparameterization trick to transform the problem in a way that allows it to benefit from analytical gradients while still being an estimator like the LR gradient. It's more efficient in most cases but can have challenges in chaotic environments where the analytical relationships aren't as straightforward.

40 of 48

Limitations of Prior Work

LR gradients: high variance, control variates introduce bias.
RP gradients: can still explode, empirical bias.
LR+RP: interpolating requires estimating both, costly.
PPO: unclear how to leverage RP gradients.
Gradient-only: no stability of sampling. Sampling-only: no gradient info.

41 of 48

Sampling in GI-PPO

Differentiable environment provides analytical gradients for observation and reward w.r.t. previous action:

Used to compute gradient of advantage function.

Policy Update

RP Gradient:

Defines πθ through a bijection gθ, mapping noise to action.

42 of 48

Adaptive α

The α-policy bridges the gap between PPO’s original policy and the target policy for RP gradient.
Adjusting α allows us to strike a balance between exploiting analytical gradients and following PPO’s policy update rules.

Adaptive Adjustments of α:

1. Variance: Adjust α to ensure stable policy updates, considering eigenvalues of the gradient.

2. Bias: Decrease α if analytical gradients seem biased.

3. Out-of-range-ratio: Ensure PPO’s restrictions are upheld.

PPO Update with α:

43 of 48

44 of 48

Takeaways

Introduced analytical gradients into the PPO framework for enhanced policy learning.
α-policy adapt the current policy, allowing variance and bias estimation of gradients.
Developed adaptive α-algorithm to adjust the influence of analytical gradients.
Outperformed baseline PPO in diverse RL environments.
Demonstrated superiority over methods like interpolated gradient due to stable PPO learning.

45 of 48

Limitations & Future Work

Approximation Challenge: Achieving true α-policy approximation remains elusive even with large α values.
PPO Dependence: GI-PPO close ties with PPO restrict the full potential of analytical gradients.
Computational Overhead: Use of backpropagation for analytical gradients demands increased training time.
Fixed PPO Boundaries: PPO's set clipping range might compromise the convergence rate in certain scenarios.

46 of 48

Summary

Overview of Reinforcement Learning
Value Learning vs Policy Learning
Sample Efficiency in RL
Exploitation-Exploration Dilemma
Guiding the policy search
Trust Region Algorithms
Gredient Informed PPO

47 of 48

References

Levine, S. and Koltun, V., 2013, May. Guided policy search. In International conference on machine learning (pp. 1-9). PMLR.

Schulman, J., Levine, S., Abbeel, P., Jordan, M. and Moritz, P., 2015, June. Trust region policy optimization. In International conference on machine learning (pp. 1889-1897). PMLR.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

Son, S., Zheng LY., Sullivan, R., Qiao, Y., Lin, MC . Gradient Informed Proximal Policy Optimization. Advances in Neural Information Processing Systems. 2023

48 of 48

Questions