1 of 13

STATS / DATA SCI 315

Lecture 06

Softmax Derivatives

Information theory basics

2 of 13

Softmax and Derivatives

3 of 13

Recall squared loss case

4 of 13

Applying chain rule

5 of 13

Cross-entropy in terms of the o’s

6 of 13

Gradient of loss w.r.t. o

7 of 13

Gradient of loss w.r.t. weights

  • Note that weights are in a q x d matrix W
  • Let wTj be the jth row of W
  • Since o = W x, we have oj = wTjx, only wj affects oj

8 of 13

Cross-entropy also works with soft observed labels

  • So far we’ve assumed hard labels in the data but soft labels for our model output
  • However, labels themselves can be soft
  • Cross-entropy loss continues to make sense with soft observed labels too
  • Interpretation: expected loss under hard labels sampled from the observed soft label

9 of 13

Information Theory Basics

10 of 13

Entropy

  • Suppose I need to encode an “alphabet” where symbol j occurs with prob P(j)
  • The encoding needs to be in binary (0s and 1s)
  • Intuitively, a good encoding with assign shorter codes to frequent symbols
  • Turns out the optimal encoding needs these many bits/symbol on average :

11 of 13

Entropy

  • If we use “nats” instead of bits, we get entropy expressed in natural logs
  • 1 nats is approx. 1.44 bits
  • The optimal encoding of symbol j will use approx. log(1/P(j)) nats

12 of 13

Cross-entropy

  • What is the true distribution was P but we think it is Q
  • We will assign an encoding with length -log Q(j) to symbol j
  • So our expected code length will be

13 of 13

KL divergence or relative entropy

  • Note that if Q is not the same as P, we expect some overhead
  • That is H(P, Q) > H(P)
  • KL divergence, aka relative entropy, measures this excess