2 of 25

List of contents for this lecture

Motivations and ideas behind SAC

Math behind SAC

The SAC Algorithm

3 of 25

Relevant readings/videos for this lecture

https://spinningup.openai.com/en/latest/algorithms/sac.html

https://www.youtube.com/watch?v=_nFXOZpo50U

Chapter 12 of Miguel Mirales, “Grokking Deep Reinforcement Learning”, Manning, 2020.

4 of 25

What is SAC? (1)

(This slide is a derived version from: http://chronos.isir.upmc.fr/~sigaud/teach/sac.pdf )

5 of 25

What is SAC? (2)

“SAC” is an algorithm that optimizes a stochastic policy in an off-policy way, forming a bridge between stochastic policy optimization and DDPG-style approaches.

A central feature of SAC is entropy regularization (what “soft” refers to in SAC). The policy is trained to maximize a trade-off between expected return and entropy, a measure of randomness in the policy.

This has a close connection to the exploration-exploitation trade-off: increasing entropy results in

more exploration, which:

can accelerate learning later.
can also prevent the policy from prematurely converging to a bad local optimum.

Published concurrently with TD3 by Haarnoja et. al., 2018: T. Haarnoja, A.Zhou, P. Abbeel, S.Levine, “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”, https://arxiv.org/abs/1801.01290.

6 of 25

Key terms and equations (1)

7 of 25

Key terms and equations (2)

8 of 25

Key terms and equations (3)

9 of 25

Key terms and equations (4)

10 of 25

SAC components

Critic 1

Critic 2

Target Critic 2

Target Critic 1

Actor

11 of 25

Versions of SAC

The version of SAC implemented “here” can only be used for environments with continuous action spaces.

Alternative SAC variants, with slightly modified policy-update rules, can be used to handle discrete action spaces (please see the SAC literature or web for discrete-action extensions).

For continuous action spaces, two SAC variants are commonly used:

1- One that uses a fixed entropy-regularization coefficient,

2- Another that enforces an entropy constraint by adapting the entropy-regularization coefficient

over the course of training.

For simplicity, we will use the version with a fixed entropy regularization coefficient, although the entropy-constrained variant is generally preferred in practice.

12 of 25

Learning Q (1)

The Q-functions are learned in a similar way to TD3, but with a few key differences.

First, what’s similar?

Like in TD3, both Q-functions are learned with MSBE (mean-squared Bellman equation error) minimization, by regressing to a single shared target.

Like in TD3, the shared target is computed using target Q-networks, and the target Q-networks are obtained by polyak averaging the Q-network parameters over the course of training.

Like in TD3, the shared target makes use of the clipped double-Q trick.

What’s different?

Unlike in TD3, the target also includes a term that comes from SAC’s use of entropy regularization.

Unlike in TD3, the next-state actions used in the target come from the current policy instead of a target policy.

Unlike in TD3, there is no explicit target policy smoothing. TD3 trains a deterministic policy, and so it accomplishes smoothing by adding random noise to the next-state actions. SAC trains a stochastic policy,

and so the noise from that stochasticity is sufficient to get a similar effect.

13 of 25

Learning Q (2)

14 of 25

Learning Q (3)

15 of 25

Learning the policy (1)

16 of 25

Learning the policy (2)

17 of 25

Learning the policy (3)

18 of 25

Learning the policy (4)

19 of 25

Automatic entropy adjustment: learning the temperature parameter in SAC

(Lagrangian form of the constrained problem)

20 of 25

Exploration and exploitation in SAC

SAC trains a stochastic policy with entropy regularization. It explores by sampling actions from this policy, while learning off-policy from a replay buffer.

The entropy regularization coefficient explicitly controls the explore-exploit tradeoff, with higher corresponding to more exploration, and lower corresponding to more exploitation.

The right coefficient (the one which leads to the stablest/highest-reward learning) may vary from environment to environment and could require careful tuning.

At test time, to see how well the policy exploits what it has learned, we remove stochasticity and use

the mean action instead of a sample from the distribution.

This tends to improve performance over the original stochastic policy.

21 of 25

SAC algorithm

22 of 25

+s, -s

+s:

Sample Efficiency: Off-policy learning enables reuse of past experiences, improving data efficiency.
Stability and Robustness: Twin Q-networks and entropy regularization reduce overestimation and encourage exploration.
Exploration-Exploitation Balance: Entropy term ensures continual exploration, avoiding premature convergence.
Reparameterization Trick: Enables low-variance gradient estimates for stochastic policies (Why?).
Continuous Action Space: Well-suited for high-dimensional, continuous control tasks (e.g., robotics).

23 of 25

SAC performance comparisons with some state-of-the-art DRL methods

24 of 25

Summary

Vanilla SAC is an off-policy deep reinforcement learning algorithm for continuous action spaces.

It maximizes a trade-off between expected return and policy entropy to encourage exploration.

SAC is sample-efficient, stable, and effective in complex tasks, but requires careful tuning and more computation than simpler algorithms.

25 of 25

References �(utilized for preparation of lecture notes or MATLAB code)

https://spinningup.openai.com/en/latest/algorithms/sac.html
https://www.youtube.com/watch?v=_nFXOZpo50U
Chapter 12 of Miguel Mirales, “Grokking Deep Reinforcement Learning”, Manning, 2020