Efficient and Generalized Deep Reinforcement Learning
Sihao Wu
Date: 04/04/2022
Challenges
15,000,000 labeled data
Introduction
Methodology – Data Augmentation
https://arxiv.org/abs/2004.14990
Data augmentation can be used to
enforce the learning of an invariant representations
Timely data augmentation could help more!
First step – state curiosity (uncertainty)
how to choose augmentation under different curiosity
Topic Arrangement - Generalization and Uncertainty
- Last week, we talk about generation and efficient RL
- This week: Let’s introduce Reinforcement Learning first, to get the foundation of collaboration
- This week later: Gaojie will do a introduction about DNN Generalization
- Next Week: Kevin will introduce deep learning uncertainty, basically about the paper [What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?]
Please pay attention to the TEAMS meeting arrangement email, if want to join us
Reinforcement Leaning
MDP:
S is the state space
A is the action space
P : S×A×S → [0, 1] is the transition probability
r (s, a) ∈ R is the reward function
γ ∈ [0, 1) is the discount factor
Cumulative reward:
Discounted cumulative reward
consider:
Reinforcement Leaning - Agent
Many different branches of DRL*
basically which can be divided into
*https://datawhalechina.github.io/easy-rl/#/
Q network
Actor network
Optimize Q network:
Q_estimate -> Q_target
(based on bellman equation)
Q_target ≈ r + γQ’
Loss = MSE(Q_estimate, Q_target)
Optimize actor network:
Maximize Q_estimate
Loss = -Q_estimate
Agent
Reinforcement Leaning – Policy update principle
Reinforcement Leaning – Q network (critic)
Q network
Actor network
Reinforcement Leaning – Q network (critic)
Reinforcement Leaning – Actor network
Policy Gradient (on policy)
Q network
Actor network
Reinforcement Leaning – Advanced DRL algorithm
Actor-Critic Networks
Q network
Actor network
Reinforcement Leaning – PPO
- Reuse the collected data, more efficient
Reinforcement Leaning – PPO
Reinforcement Leaning – PPO
Q network
Actor network
Reinforcement Leaning – DDPG
- Goal: From discrete action space -> continues action space
Discrete action v.s. Continues action
Output probability of action
Output value of action
Scale to [-2,2], like vehicle speed
Up
Stop
Down
Reinforcement Leaning – DDPG
Tricks
Output deterministic action
Policy network update each step
Reinforcement Leaning – DDPG
Target network + ReplayMemory
Q network
Actor network
Update Q network (w)
Reinforcement Leaning – SAC
stochastic policy, open-source, the strongest algorithm now
Motivation:
Why Maximum Entropy Reinforcement Learning?
Advantages compared with DDPG:
SAC
key idea: utilize every valuable action
DDPG: consider only one optimal action for state s_t
SAC: consider several optimal action (since entropy will distribute action distribution )
Reinforcement Leaning – SAC
Reinforcement Leaning – SAC
Three key components in SAC:
Q network
Actor network
Reinforcement Learning – SAC
Reinforcement Leaning – generalization and uncertainty
MDP:
*https://hal.inria.fr/tel-03035705/document
Reinforcement Leaning – Literature Review
Different classical RL algorithms
sim2real
Is Deep RL data-efficient and generalized well?
We foresee a future in which data-efficient and generalized Reinforcement Learning will boost development of robotics control and autonomous vehicle in real-world
https://arxiv.org/abs/1811.05939
First step
how to choose augmentation under different curiosity
Research Goal (final/future goal)
How to build training data distribution
- Use prior knowledge to define the data distribution transfer approach (randomization / data augmentation) for sim2real
- Bayesian prior knowledge (bayelime -> a. use to explain sequential, b. use xAI to augment RL)
- Pixel-based end-to-end RL for CARLA autonomous vehicle, especially struggling at the aspect, data-efficiency and generalization (bird’s-eye view may a good direction)
- no paper to explore how to define the generalization, safety or sim2real in CARLA competition (now focused on performance increasing ~94% completion)
- Algorithm optimization based generalization
- How to define uncertainty in RL? How uncertainty help? How to evaluate the uncertainty during training?
https://leaderboard.carla.org/
Thank you for your attention!
Q & A