JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 13

STATS / DATA SCI 315

Lecture 06

Softmax Derivatives

Information theory basics

2 of 13

Softmax and Derivatives

3 of 13

Recall squared loss case

4 of 13

Applying chain rule

5 of 13

Cross-entropy in terms of the o’s

6 of 13

Gradient of loss w.r.t. o

7 of 13

Gradient of loss w.r.t. weights

Note that weights are in a q x d matrix W
Let w^T_j be the jth row of W
Since o = W x, we have o_j = w^T_jx, only w_j affects o_j

8 of 13

Cross-entropy also works with soft observed labels

So far we’ve assumed hard labels in the data but soft labels for our model output
However, labels themselves can be soft
Cross-entropy loss continues to make sense with soft observed labels too
Interpretation: expected loss under hard labels sampled from the observed soft label

9 of 13

Information Theory Basics

10 of 13

Entropy

Suppose I need to encode an “alphabet” where symbol j occurs with prob P(j)
The encoding needs to be in binary (0s and 1s)
Intuitively, a good encoding with assign shorter codes to frequent symbols
Turns out the optimal encoding needs these many bits/symbol on average :

11 of 13

Entropy

If we use “nats” instead of bits, we get entropy expressed in natural logs
1 nats is approx. 1.44 bits
The optimal encoding of symbol j will use approx. log(1/P(j)) nats

12 of 13

Cross-entropy

What is the true distribution was P but we think it is Q
We will assign an encoding with length -log Q(j) to symbol j
So our expected code length will be

13 of 13

KL divergence or relative entropy

Note that if Q is not the same as P, we expect some overhead
That is H(P, Q) > H(P)
KL divergence, aka relative entropy, measures this excess