1 of 14

Lecture 7.� Off-policy Policy Gradient�

Sookyung Kim

1

2 of 14

Taxonomy of RL algorithm

2

3 of 14

Off-policy vs On-policy

Policy Gradient

Q-learning �

3

4 of 14

Off-policy vs On-policy

4

5 of 14

Policy Gradient is On-policy

5

6 of 14

Off-policy learning & importance sampling

q(x)

6

7 of 14

Off-policy learning & importance sampling

q(x)

Come from the environment

7

8 of 14

Deriving policy gradient with �importance sampling

8

9 of 14

The off-policy policy gradient

9

10 of 14

The off-policy policy gradient

<1

Sometimes work�in practice,�when important sampling has�same state distribution�with the policy.

10

11 of 14

Implementing policy gradient:� Tensorflow

11

12 of 14

Implementing policy gradient:� automatic differentiation

12

13 of 14

Implementing policy gradient:� automatic differentiation

13

14 of 14

Implementing policy gradient:�In practice

REINFORCE using TF:

14