1 of 15

Lecture 14๏ฟฝContinuous Q learning ๏ฟฝ๏ฟฝ

Sookyung Kim๏ฟฝ

1

2 of 15

Tentative Schedules

  • Week 12 - Continuous Q learning (11/17), Offline-RL 1 (11/19)
  • Week 13 - Offline RL 2 (11/24), Inverse RL (11/26)
  • Week 14 - Intrinsic Reward (Curiosity Driven RL) (12/1), LLM+RL 1 (12/3)
  • Week 15 - LLM+RL2 (12/8), Final Review(12/10)
  • Week 16 - Final Exam (12/15) - MCQ, covering all topics (mainly concepts in first part)

Online Video Lecture

  • Neurips 2026 # 12/1-9: Soo is away

2

3 of 15

RECAP: Q-learning

Transition, and aโ€™ doesnโ€™t have to be ๏ฟฝsampled from current policy:๏ฟฝoff-policy RL

3

4 of 15

Recap: Problems of online Q-learning algorithm

(1) Correlated Samples

(2) Target changes as Q changes

4

5 of 15

Recap: Correlated samples in online Q-learning - Experience replay buffer

5

6 of 15

Recap - Target Changes as Q Changes: ๏ฟฝTarget Network

  • Fixed Target Network
    • The error is unstable because the target function, which varies with the network parameters, is included in the error calculation.
    • To enhance the stability of the training process, the target is updated only once every 1,000 steps

Target (Bellman)

Prediction

6

7 of 15

Recap - Target Changes as Q Changes: ๏ฟฝTarget Network

7

8 of 15

Q-Learning with Continuous Actions

8

9 of 15

Whatโ€™s the problem with continuous action?

9

10 of 15

Optimization after๏ฟฝ discretization of continuous action

10

11 of 15

Use function class that is easy to optimize

11

12 of 15

Learn an approximate maximizer

12

13 of 15

Learn an approximate maximizer

~ NFQCA, TD3, SAC

13

14 of 15

Summary of DDPG

14

15 of 15

Summary of DDPG

15