1 of 26

Lecture 6 - RL algorithms: �Policy-based RL�

Sookyung Kim

1

2 of 26

(Recap) Basic of Reinforcement Learning

  • Reinforcement Learning:�Input: Given environment which provides numerical reward signal, and agent which act inside of that environment�Outputs: Let agent learn how to take actions(policy) in order to maximize reward.

���

  • Goal: Learn how to take actions in order to maximize reward
  • Design RL: Objective, State, Action, Reward

2

3 of 26

Recap - Key Concepts in RL

  •  

3

4 of 26

Recap - Key Concepts in RL

  •  
  1. When s0 is given. 🡪 Value function (V)
  2. When s0,a0 is given. 🡪 Q function (Q)

Can numerically optimize policy using self-consistent Bellman Function

 

 

4

5 of 26

Taxonomy of RL algorithm

Policy-based

Value-based

2

1

3

5

6 of 26

The goal of Reinforcement Learning

6

7 of 26

The goal of Reinforcement Learning

7

8 of 26

Evaluating Objective

8

9 of 26

Direct policy differentiation

9

10 of 26

Direct policy differentiation

10

11 of 26

REINFORCE –policy gradient algorithm

11

12 of 26

Evaluating the policy gradient

12

13 of 26

Comparison to maximum likelihood

13

14 of 26

Example: Gaussian policies

14

15 of 26

What did we just do?

15

16 of 26

Partial observability

16

17 of 26

Problem of Policy Gradient

17

18 of 26

Problem of Policy Gradient - LLM post-training

18

19 of 26

Problem of Policy Gradient - LLM post-training

19

20 of 26

Problem of Policy Gradient - LLM post-training

20

21 of 26

Problem of Policy Gradient - LLM post-training

21

22 of 26

Problem of Policy Gradient - LLM post-training

https://grok.com/ani

22

23 of 26

Reducing variance

23

24 of 26

Baselines

24

25 of 26

Analyzing variance

25

26 of 26

Review

26