1 of 14

Lecture 7.� Off-policy Policy Gradient�

Sookyung Kim�

1

2 of 14

Taxonomy of RL algorithm

2

3 of 14

Off-policy vs On-policy

Policy Gradient

Q-learning �

3

4 of 14

Off-policy vs On-policy

4

5 of 14

Policy Gradient is On-policy

5

6 of 14

Off-policy learning & importance sampling

q(x)

6

7 of 14

Off-policy learning & importance sampling

q(x)

Come from the environment

7

8 of 14

Deriving policy gradient with �importance sampling

8

9 of 14

The off-policy policy gradient

9

10 of 14

The off-policy policy gradient

<1

Sometimes work�in practice,�when important sampling has�same state distribution�with the policy.

10

�Important!! �If you sampling from important sampling,

Then, your sample is from pi_theta, rather than pi_theta’.

So, pi_theta ‘ < pi_theta.

So, Sampling weight (red box) is less than 1 (0-1).��Multiply many time🡪 the middle term gose to 0. (트레제토리가 길어지만. → log of zero—> exploding variance. )��Q’: reward to go!�- The way we sample s_t, a_t is rolling out our sample from environment�- Sampling s, a pair from policy can be thinked as if sampling s, a marginal at each time step t. �- Because when you sample the entire trajectory, the corresponding state and action at every time step look indistinguishable from what you would have gotten, if you sample from the state, action marginal at that time step. (pi(a|s) —> pi(a,s))�- We can split marginal pi as the multiplication of pi(s) and pi(a|s)

ignore the state marginal probability of policy….

11 of 14

Implementing policy gradient:� Tensorflow

11

12 of 14

Implementing policy gradient:� automatic differentiation

12

13 of 14

Implementing policy gradient:� automatic differentiation

13

14 of 14

Implementing policy gradient:�In practice

https://github.com/fastscience-ai/RL_toturial_AIAI2022/blob/main/policy_gradient.ipynb

REINFORCE using TF:

14