Lecture 8��Twin-Delayed DDPG (TD3)
1
Instructor: Ercan Atam
Institute for Data Science & Artificial Intelligence
Course: DSAI 642- Advanced Reinforcement Learning
2
List of contents for this lecture
3
Relevant readings for this lecture
4
What is TD3?
5
Motivation for TD3: Why is plain DDPG problematic?
6
TD3 introduces three tricks to fix the issues in DDPG
optimistic bias
noise to the next action when computing the target
7
TD3 architecture overview (1)
Critic 1
Critic 2
Actor
Target Critic 1
Target Critic 2
Target Actor
8
TD3 architecture overview (2)
9
Key operations
Three key operations:
10
Twin critics + clipped double Q-Learning
11
Target policy smoothing (1)
(TD3)
12
Target policy smoothing (2)
13
Clipped double Q-learning under target policy smoothing
Target
Critic 1
Target
Critic 2
Target
Actor
Min
Min
TD-Target
14
Delayed policy and target updates (1)
While target networks are used in DDPG as well, the authors of TD3 note that the interplay between the actor and critic in such settings can also be a reason for failure. As is stated in the paper,
15
Delayed policy and target updates (2)
Proposed solution: Update the two critic networks more frequently than the actor and the three target networks:
Fig. The two critic networks are more frequently updated than the actor and the three target networks.
As per the paper, the critics are updated at every step, while the actor and the targets are updated every second step.
Critic 1
Critic 2
Actor
Target Critic 1
Target Critic 2
Target Actor
update frequency
16
Delayed policy and target updates (3)
17
How does the actor learn? (1)
18
How does the actor learn? (2)
Critic 1
Critic 2
Actor
19
How does the agent explore the environment? (1)
20
How does the agent explore the environment? (2)
21
The TD3 Algorithm
22
+s, -s and summary
+s:
-s:
References �(utilized for preparation of lecture notes or MATLAB code)
23