Lecture 7.� Off-policy Policy Gradient�
Sookyung Kim�
1
Taxonomy of RL algorithm
2
Off-policy vs On-policy
Policy Gradient
Q-learning �
3
Off-policy vs On-policy
4
Policy Gradient is On-policy
5
Off-policy learning & importance sampling
q(x)
6
Off-policy learning & importance sampling
q(x)
Come from the environment
7
Deriving policy gradient with �importance sampling
8
The off-policy policy gradient
9
The off-policy policy gradient
<1
Sometimes work�in practice,�when important sampling has�same state distribution�with the policy.
10
Implementing policy gradient:� Tensorflow
11
Implementing policy gradient:� automatic differentiation
12
Implementing policy gradient:� automatic differentiation
13
Implementing policy gradient:�In practice
REINFORCE using TF:
14