Lecture 10��Soft Actor-Critic (SAC)
1
Instructor: Ercan Atam
Institute for Data Science & Artificial Intelligence
Course: DSAI 642- Advanced Reinforcement Learning
2
List of contents for this lecture
3
Relevant readings/videos for this lecture
4
What is SAC? (1)
(This slide is a derived version from: http://chronos.isir.upmc.fr/~sigaud/teach/sac.pdf )
5
What is SAC? (2)
more exploration, which:
6
Key terms and equations (1)
7
Key terms and equations (2)
8
Key terms and equations (3)
9
Key terms and equations (4)
10
SAC components
Critic 1
Critic 2
Target Critic 2
Target Critic 1
Actor
11
Versions of SAC
1- One that uses a fixed entropy-regularization coefficient,
2- Another that enforces an entropy constraint by adapting the entropy-regularization coefficient
over the course of training.
For simplicity, we will use the version with a fixed entropy regularization coefficient, although the entropy-constrained variant is generally preferred in practice.
12
Learning Q (1)
The Q-functions are learned in a similar way to TD3, but with a few key differences.
First, what’s similar?
What’s different?
and so the noise from that stochasticity is sufficient to get a similar effect.
13
Learning Q (2)
14
Learning Q (3)
15
Learning the policy (1)
16
Learning the policy (2)
17
Learning the policy (3)
18
Learning the policy (4)
19
Automatic entropy adjustment: learning the temperature parameter in SAC
(Lagrangian form of the constrained problem)
20
Exploration and exploitation in SAC
SAC trains a stochastic policy with entropy regularization. It explores by sampling actions from this policy, while learning off-policy from a replay buffer.
At test time, to see how well the policy exploits what it has learned, we remove stochasticity and use
the mean action instead of a sample from the distribution.
21
SAC algorithm
22
+s, -s
+s:
23
SAC performance comparisons with some state-of-the-art DRL methods
24
Summary
References �(utilized for preparation of lecture notes or MATLAB code)
25