�all right now the causality trick that i
5:24
described before
5:25
you can always use it , it reduces your variance
���there's another slightly more involved
5:31
trick that we can use that also turns
5:33
out to be very important
5:34
to make policy gradients practical and
5:36
it's something called a baseline//
5:40
so let's think back to this cartoon that
5:43
we had where we collect some
5:44
trajectories
5:45
and we evaluate the rewards and then we
5:48
try to make the good ones more likely
5:49
and the bad ones less likely..///
5:52
a very straightforward elegant way to
5:55
formalize trial and error learning
5:57
as a grdiant ascend procedure /// but is this
6:00
actually what policy gradients
6:02
do ? ////well intuitively
6:06
uh policy gradients will do this if the
6:09
rewards are centered
6:10
meaning that the good trajectories have
6:12
positive rewards and the bad
6:13
trajectories
6:14
have negative rewards/// but this might not
6:16
necessarily be true
6:17
what if all of your rewards are positive
6:19
then the green check mark will be
6:22
increased its probability will be
6:23
increased the yellow check mark will be
6:25
increased a little bit
6:26
and the red x will be also increased but
6:28
a tiny bit////
6:30
so intuitively it kind of seems like
6:32
what we want to do is we want to center
6:34
our rewards
6:34
so the things that are better than
6:36
average get increased and the things
6:38
that are worse than average get
6:39
decreased////
6:40
for example maybe we want to subtract a
6:42
quantity from our reward which is the
6:44
average reward so instead of multiplying
6:47
grad log
6:48
p by r of tau we multiply by r of tau
6:51
minus b
6:52
where b is the average reward////
this would
6:55
cause policy gradients to align with our
6:57
intuition this would make policy
6:58
gradients
6:59
increase the probability of trajectories
7:01
that are better than average
7:03
and decrease the probabilities of
7:04
trajectories that are worse than average////
7:07
and then this would be true regardless
7:09
of what the reward function actually is
7:10
even if the rewards are always positive////
7:13
that seems very intuitive but
7:17
are we allowed to do that???
� it seems like
7:19
we just arbitrarily subtract our
7:21
constant from all of our rewards
7:23
is this even correct still???
��7:26
well it turns out that you can show
7:29
that subtracting a constant b from your
7:31
rewards in policy gradient
7:33
will not actually change the gradient in
7:36
expectation
7:37
although it will change its variance///
7:39
meaning that for any b
7:41
doing this trick will keep your grading
7:44
estimator unbiased///
7:47
here's how we can derive this
� so we're
7:49
going to use the same convenient
7:50
identity from before
7:51
which is that p of tau times grad log p
7:54
of tau
7:55
is equal to grad p of tau and now we're
7:58
going to substitute this
7:59
identity in the opposite direction/// so
8:02
what we're going to do is we're going to
8:04
analyze grad log
8:06
p of tau times b so if i take the
8:08
difference r of tau minus b
8:10
and i distribute grad log p into it then
8:12
i get a grad log p
8:14
times r term which is my original policy
8:16
gradient minus
8:17
a grad log p times b term which is the
8:20
new term
8:21
uh that i'm adding so let's analyze just
8:23
that terms the expected value
8:25
of grad log p times b which means that
8:28
it's the integral
8:29
of p of tau times grad log p of tau
8:31
times b
8:33
and now i'm going to substitute my
8:34
identity back in so
8:36
using the convenient ending in the blue
8:38
box over there i i know
8:40
this is equal to the integral of grad p
8:42
of tau
8:43
times b///
now by linearity of the gradient
8:47
operator
8:47
i can take both the gradient operator
8:49
and b outside the integral
8:51
so this is equal to b times the gradient
8:53
of the integral over tau of p
8:55
of tau but p of tau is a probability
8:58
distribution
8:59
and we know that probability
9:00
distributions integrate to one///
9:02
which means that this is equal to b
9:05
times the gradient with respect to theta
9:06
of one ///but the grading with respect to
9:09
theta of one
9:10
is zero because one doesn't depend on
9:12
theta ///therefore we know that this
9:14
expected value
9:15
comes out equal to zero in expectation///
9:19
but for a finite number of samples it's
9:20
not equal to zero
9:22
so what this means is that subtracting b
9:24
will remain will keep our policy
9:26
gradient unbiased
9:27
but it will actually alter its variance///
9:30
so subtracting a by a baseline is
9:32
unbiased in expectation//
9:35
the average reward which is what i'm
9:37
using here turns out to not actually be
9:39
the best
9:40
baseline but it's actually pretty good
9:42
and in many cases when we just need a
9:44
quick and dirty baseline
9:45
we'll use average reward///
� however we can
9:48
actually derive the optimal baseline
9:51
the optimal baseline is not used very
9:53
much in practical policy grading
9:54
algorithms
9:55
but it's perhaps instructive to derive
9:57
it just to understand some of the
9:58
mathematical tools
10:00
that go to studying variants//
�