1 of 20

STATS / DATA SCI 315

Lecture 05

Classification

Softmax

Cross-entropy loss

2 of 20

Classification Problems

3 of 20

Regression vs Classification

  • Regression answers how much? or how many? questions
    • Dollar price of a home
    • # of wins of a baseball team
    • Hospitalization duration in days
  • Classification answers which one? questions
    • Is this email spam or legit?
    • Will a customer sign up for a new service?
    • Which movie is a customer going to watch next (see extreme classification)?

4 of 20

Toy problem

  • Imagine classifying 2 x 2 grey scale images into one of the 3 categories:
    • “cat” , “chicken”, “dog”
  • Input image consists of just 4 features 𝑥1,𝑥2,𝑥3,𝑥4
  • How do we represent the label y?
  • We could use y ϵ {1, 2, 3} but this suggests an ordering in the labels
  • One-hot encoding 𝑦 would be a three-dimensional vector, with (1,0,0) corresponding to “cat”, (0,1,0) to “chicken”, and (0,0,1) to “dog”

5 of 20

What does our model output?

  • We might want to output hard labels, i.e., a definite assignment to one of the 3 classes
  • Or we might prefer soft labels, i.e. probabilities
  • E.g., (0.5, 0, 0.5) to 50-50 chance for cat or dog
  • Need a model with multiple outputs, one per class
  • If restrict ourselves to linear (actually affine) models then we need 3 affine functions
  • Each affine function has 4 weights and 1 bias

6 of 20

Network Architecture

7 of 20

8 of 20

Compact matrix notation

  • We gather all of our weights into a 3×4 matrix W
  • For features of a given data example 𝐱 , our outputs are given by:�a matrix-vector product of weights with features plus biases 𝐛
  • 𝐨 = 𝐖𝐱+𝐛
  • Left dimension: output 3 x 1
  • Right dimensions:
    • Matrix-vector product: (3 x 4) x (4 x 1)
    • Bias: 3 x 1

9 of 20

Parameterization Cost

10 of 20

Parameters in a fully connected layer

  • Fully-connected layers are ubiquitous in deep learning
  • Fully-connected layers have many learnable parameters
  • For a f.c. layer with 𝑑 inputs and 𝑞 outputs, the parameterization cost is O(𝑑𝑞)

11 of 20

Softmax Operation

12 of 20

Why softmax?

  • Recall linear model for probabilities
  • Outputs are not necessarily positive!
  • Outputs don’t sum to 1!
  • Violates basic probability laws

13 of 20

Softmax function

  • We exponentiate our outputs (to get positives values)
  • Then divide by the sum (to get them to sum to 1)

14 of 20

Properties of softmax function

  • It produces valid probabilities
  • It preserves ordering
  • Our model has now become: softmax(W x + b)
  • D2L book says this is still a linear model
  • I think a more appropriate description is a generalized linear model (you’re allowed to add one nonlinear function on top of a linear model)

15 of 20

Softmax output as conditional probabilities

  • On some input x softmax function gives us a vector�����Where�����we can interpret this as estimated conditional probabilities of each class given any input 𝐱 �

16 of 20

Likelihood

17 of 20

Likelihood

  • Suppose that the entire dataset {𝐗,𝐘} has 𝑛 examples
  • Example 𝑖 consists of a feature vector 𝐱(𝑖) and a one-hot label vector 𝐲(𝑖)
  • Probability of actual observed class labels given features

18 of 20

Likelihood

  • Note that 𝑃(𝐲(𝑖) ∣ 𝐱(𝑖)) can be written in terms of the ŷ(i)
  • It is simply the product:

  • This is a complicated way of saying that the probability of seeing an observed label is one of the components of your predicted probability vector!

19 of 20

Log Likelihood

20 of 20

Cross-entropy loss

  • Because of one-hot encoding, only one term survives
  • You pay a high loss if an improbable label (according to your model) is seen
  • Loss is always non-negative
  • When is it (close to) zero?