1 of 29

Efficient and Generalized Deep Reinforcement Learning

  • 35 mins basic DRL introduction
  • 15 mins Q&A

Sihao Wu

Date: 04/04/2022

2 of 29

Challenges

15,000,000 labeled data

  • Supervised Learning (e.g. classification) provides ten of millions labeled data

  • Reinforcement Learning explores the environment to collect data, which is time-consuming

  • Slightly change of environment will extremely impact RL agent performance

3 of 29

  • the same test and train environments, which does not match with real-world scenarios

  • To improving generalization needs to specify the environments

  • Sim-to-real is for robotics

Introduction

4 of 29

Methodology – Data Augmentation

https://arxiv.org/abs/2004.14990

Data augmentation can be used to

enforce the learning of an invariant representations

Timely data augmentation could help more!

5 of 29

First step – state curiosity (uncertainty)

how to choose augmentation under different curiosity

  1. larger curiosity, strong augment? lower curiosity, weak augment?

  1. lower curiosity, strong augment? larger curiosity, weak augment?

  1. how to balance strong augmentation and weak augmentation

6 of 29

Topic Arrangement - Generalization and Uncertainty

- Last week, we talk about generation and efficient RL

- This week: Let’s introduce Reinforcement Learning first, to get the foundation of collaboration

- This week later: Gaojie will do a introduction about DNN Generalization

- Next Week: Kevin will introduce deep learning uncertainty, basically about the paper [What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?]

Please pay attention to the TEAMS meeting arrangement email, if want to join us

7 of 29

Reinforcement Leaning

  • The DRL agent referred to as the Markov Decision Process (MDP).

MDP:

S is the state space

A is the action space

P : S×A×S → [0, 1] is the transition probability

r (s, a) R is the reward function

γ [0, 1) is the discount factor

Cumulative reward:

Discounted cumulative reward

consider:

8 of 29

Reinforcement Leaning - Agent

Many different branches of DRL*

basically which can be divided into

  • Critic Network (value-based RL), [Q network]
  • Actor Network (policy-based RL)
  • Actor-Critic Network

*https://datawhalechina.github.io/easy-rl/#/

Q network

Actor network

Optimize Q network:

Q_estimate -> Q_target

(based on bellman equation)

Q_target ≈ r + γQ’

Loss = MSE(Q_estimate, Q_target)

Optimize actor network:

Maximize Q_estimate

Loss = -Q_estimate

Agent

9 of 29

Reinforcement Leaning – Policy update principle

10 of 29

Reinforcement Leaning – Q network (critic)

Q network

Actor network

11 of 29

Reinforcement Leaning – Q network (critic)

  • using the reward of next state-action pair to update Q value of current state-action pair

  • To ensure the exploration property, apply epsilon-greedy

  • The target Q network is updated based on exponentially moving average

12 of 29

Reinforcement Leaning – Actor network

Policy Gradient (on policy)

Q network

Actor network

13 of 29

Reinforcement Leaning – Advanced DRL algorithm

Actor-Critic Networks

  • PPO (Proximal Policy Optimization)

  • DDPG (Deep Deterministic Policy Gradient)

  • SAC (Soft Actor Critic) [seems SOTA]

Q network

Actor network

14 of 29

Reinforcement Leaning – PPO

- Reuse the collected data, more efficient

15 of 29

Reinforcement Leaning – PPO

16 of 29

Reinforcement Leaning – PPO

Q network

Actor network

17 of 29

Reinforcement Leaning – DDPG

- Goal: From discrete action space -> continues action space

Discrete action v.s. Continues action

Output probability of action

Output value of action

Scale to [-2,2], like vehicle speed

Up

Stop

Down

18 of 29

Reinforcement Leaning – DDPG

Tricks

Output deterministic action

Policy network update each step

19 of 29

Reinforcement Leaning – DDPG

Target network + ReplayMemory

Q network

Actor network

Update Q network (w)

 

20 of 29

Reinforcement Leaning – SAC

  • TRPO, PPO (on-policy) -> sample inefficiency

  • DDPG, D4PG(not open-source) -> off-policy, deterministic policy

  • Soft Q-Learning, Soft Actor-Critic (SAC) ->

stochastic policy, open-source, the strongest algorithm now

Motivation:

21 of 29

Why Maximum Entropy Reinforcement Learning?

Advantages compared with DDPG:

SAC

key idea: utilize every valuable action

DDPG: consider only one optimal action for state s_t

SAC: consider several optimal action (since entropy will distribute action distribution )

Reinforcement Leaning – SAC

22 of 29

Reinforcement Leaning – SAC

Three key components in SAC:

  • An actor-critic architecture with separate policy and value function networks;

  • An off-policy formulation that enables reuse of previously collected data for efficiency;

  • Entropy maximization to enable stability and exploration.

Q network

Actor network

23 of 29

Reinforcement Learning – SAC

24 of 29

Reinforcement Leaning – generalization and uncertainty

MDP:

  • How to get a generalized well RL agent?

  • Data augmentation, representation learning, active pretraining, domain randomization (sim2real)

  • Generalization will be a quite important aspect for RL

  • Safe RL* (multi-objective)

  • Uncertainty technology and representation learning will help increase generalization and efficiency

  • Application: planning + control of robotics/multi-robotics, end-to-end autonomous vehicle (CARLA competition)

*https://hal.inria.fr/tel-03035705/document

25 of 29

Reinforcement Leaning – Literature Review

Different classical RL algorithms

26 of 29

sim2real

Is Deep RL data-efficient and generalized well?

We foresee a future in which data-efficient and generalized Reinforcement Learning will boost development of robotics control and autonomous vehicle in real-world

https://arxiv.org/abs/1811.05939

27 of 29

First step

how to choose augmentation under different curiosity

  1. larger curiosity, strong augment? lower curiosity, weak augment?

  1. lower curiosity, strong augment? larger curiosity, weak augment?

  1. how to balance strong augmentation and weak augmentation

28 of 29

Research Goal (final/future goal)

How to build training data distribution

- Use prior knowledge to define the data distribution transfer approach (randomization / data augmentation) for sim2real

- Bayesian prior knowledge (bayelime -> a. use to explain sequential, b. use xAI to augment RL)

- Pixel-based end-to-end RL for CARLA autonomous vehicle, especially struggling at the aspect, data-efficiency and generalization (bird’s-eye view may a good direction)

- no paper to explore how to define the generalization, safety or sim2real in CARLA competition (now focused on performance increasing ~94% completion)

- Algorithm optimization based generalization

- How to define uncertainty in RL? How uncertainty help? How to evaluate the uncertainty during training?

https://leaderboard.carla.org/

29 of 29

Thank you for your attention!

Q & A