1 of 23

Lecture 8��Twin-Delayed DDPG (TD3)

1

Instructor: Ercan Atam

Institute for Data Science & Artificial Intelligence

Course: DSAI 642- Advanced Reinforcement Learning

2 of 23

2

List of contents for this lecture

  • What is TD3?

  • Motivation behind TD3?

  • Math behind TD3

  • The TD3 Algorithm

3 of 23

3

Relevant readings for this lecture

  • Chapter 12 of Miguel Mirales, “Grokking Deep Reinforcement Learning”, Manning, 2020.

4 of 23

4

What is TD3?

  • TD3 (Twin-Delayed DDPG) is a model-free, off-policy, actor-critic algorithm designed for continuous action spaces.

  • It builds on Deep Deterministic Policy Gradient (DDPG).

  • TD3 improves stability and performance by addressing overestimation bias and policy variance in DDPG.

  • TD3 was developed by Fujimoto et. al., 2018, “Addressing Function Approximation Error in Actor-Critic Methods”, https://arxiv.org/abs/1802.09477.

5 of 23

5

Motivation for TD3: Why is plain DDPG problematic?

6 of 23

6

TD3 introduces three tricks to fix the issues in DDPG

  • TD3 introduces three core improvements to DDPG:
    • Twin critics → reduce overestimation bias by using minimum of two critics to remove

optimistic bias

    • Target policy smoothing → make Q less sensitive to small changes in the action by adding

noise to the next action when computing the target

    • Delayed policy updates→ update actor (and target networks) less frequently than critics

  • Compared to DDPG, TD3 is an improved version that is more stable and sample-efficient.

7 of 23

7

TD3 architecture overview (1)

Critic 1

Critic 2

Actor

Target Critic 1

Target Critic 2

Target Actor

8 of 23

8

TD3 architecture overview (2)

9 of 23

9

Key operations

Three key operations:

    • Twin critics + clipped double Q-learning

    • Target smoothing

    • Delayed policy and target updates

10 of 23

10

Twin critics + clipped double Q-Learning

11 of 23

11

Target policy smoothing (1)

(TD3)

12 of 23

12

Target policy smoothing (2)

13 of 23

13

Clipped double Q-learning under target policy smoothing

Target

Critic 1

Target

Critic 2

Target

Actor

Min

Min

 

TD-Target

14 of 23

14

Delayed policy and target updates (1)

While target networks are used in DDPG as well, the authors of TD3 note that the interplay between the actor and critic in such settings can also be a reason for failure. As is stated in the paper,

15 of 23

15

Delayed policy and target updates (2)

Proposed solution: Update the two critic networks more frequently than the actor and the three target networks:

Fig. The two critic networks are more frequently updated than the actor and the three target networks.

As per the paper, the critics are updated at every step, while the actor and the targets are updated every second step.

Critic 1

Critic 2

Actor

Target Critic 1

Target Critic 2

Target Actor

update frequency

16 of 23

16

Delayed policy and target updates (3)

17 of 23

17

How does the actor learn? (1)

18 of 23

18

How does the actor learn? (2)

Critic 1

Critic 2

Actor

19 of 23

19

How does the agent explore the environment? (1)

20 of 23

20

How does the agent explore the environment? (2)

21 of 23

21

The TD3 Algorithm

22 of 23

22

+s, -s and summary

+s:

    • Reduces overestimation bias (thanks to twin critics + clipped double Q-learning)
    • Stabilizes training (thanks to delayed actor updates)
    • Improves robustness (thanks to target policy smoothing)
    • Outperforms DDPG on continuous control benchmarks
  • TD3 is widely used in benchmarks environments such as MuJoCo (a physics engine for multi-joint robots).
  • TD3 is a foundation for more advanced RL methods.

-s:

    • Computationally more expensive (two critics)
    • Still sensitive to hyperparameters
    • Can be sample inefficient compared to model-based methods
    • Only applicable to continuous action spaces

23 of 23

References �(utilized for preparation of lecture notes or MATLAB code)

23