Enhancing Sampling Efficiency in RL
Mohamed Elnoor
Department of Electrical and Computer Engineering
University of Maryland
11/2/2023
Introduction
Contents
Machine Learning Methods
https://www.mathworks.com/discovery/reinforcement-learning.html
Why Reinforcement Learning?
Reinforcement Learning Overview
Reinforcement Learning Overview
Reinforcement Learning Overview
Reinforcement Learning Overview: RL as MDP
Markov Decision Process (MDP): An MDP defines the environment in which an RL agent operates, characterized by:
The agent aims to determine the optimal policy that maximizes the
expected cumulative reward:
Value Learning vs. Policy Learning
Value Learning Policy Learning
Find Q(s,a) Find π(s)
a = argmax Q(s,a) Sample a ~ π(s)
Value Learning
Wang, Y., Wang, L., & Dong, X. (2021). An intelligent TCP congestion control method based on deep Q network. Future Internet, 13(10), 261.
Policy Learning
http://introtodeeplearning.com/
Policy Learning
loss = − log P(a|st ) · Rt
w ′ = w − ∇loss
w ′ = w + ∇ log P(a|st ) · Rt
http://introtodeeplearning.com/
Use of Gradient in Policy Learning
Sample Efficiency in Policy Learning
Motivation:
Why It Matters:
Exploration-Exploitation Dilemma
The Dilemma:
Risks of Pure Exploitation:
Guided Policy Search
Policy Optimization: A refersher
Policy Gradient: Optimize E [J(π)]
▶ Parameters: θ
▶ Gradient Estimate:
Concept and Usage of Importance Sampling
Importance Sampling:
Usage:
Variance Reduction:
Constructing Guiding Distributions
Adaptive Guiding Distributions
The distribution in the preceding section captures high-reward regions, but does not consider the current policy. It can be adapted to the policy by sampling from an l-projection of p(ξ) ∝ exp(r (ξ)/η), which is the optimal distribution for estimating Ep(ξ)[r (ξ)].
π∗(x, u) = πAD (x) + log πGD (x, u)
The resulting distribution is then an approximate l-projection of
p(ξ) ∝ exp(r (ξ))
Incorporating Guiding Samples
Experiments
Trusted Region Algorithms
Trust Region Policy Optimization (TRPO)
TRPO
Advantage Function
TRPO
Proximal Policy Optimization (PPO)
TRPO vs PPO
TRPO
PPO
PPO algorithm
GI-PPO
Types of gradients
Analytical gradients:
Likelihood ratio (LR) gradient:
Reparameterization (RP) gradient:
Limitations of Prior Work
Sampling in GI-PPO
Differentiable environment provides analytical gradients for observation and reward w.r.t. previous action:
Used to compute gradient of advantage function.
Policy Update
RP Gradient:
Adaptive α
Adaptive Adjustments of α:
1. Variance: Adjust α to ensure stable policy updates, considering eigenvalues of the gradient.
2. Bias: Decrease α if analytical gradients seem biased.
3. Out-of-range-ratio: Ensure PPO’s restrictions are upheld.
PPO Update with α:
Takeaways
Limitations & Future Work
Summary
References
Levine, S. and Koltun, V., 2013, May. Guided policy search. In International conference on machine learning (pp. 1-9). PMLR.
Schulman, J., Levine, S., Abbeel, P., Jordan, M. and Moritz, P., 2015, June. Trust region policy optimization. In International conference on machine learning (pp. 1889-1897). PMLR.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
Son, S., Zheng LY., Sullivan, R., Qiao, Y., Lin, MC . Gradient Informed Proximal Policy Optimization. Advances in Neural Information Processing Systems. 2023
Questions