1 of 20

STATS / DATA SCI 315

Lecture 05

Classification

Softmax

Cross-entropy loss

2 of 20

Classification Problems

3 of 20

Regression vs Classification

Regression answers how much? or how many? questions

Dollar price of a home
# of wins of a baseball team
Hospitalization duration in days

Classification answers which one? questions

Is this email spam or legit?
Will a customer sign up for a new service?
Which movie is a customer going to watch next (see extreme classification)?

4 of 20

Toy problem

Imagine classifying 2 x 2 grey scale images into one of the 3 categories:

“cat” , “chicken”, “dog”

Input image consists of just 4 features 𝑥₁,𝑥₂,𝑥₃,𝑥₄
How do we represent the label y?
We could use y ϵ {1, 2, 3} but this suggests an ordering in the labels
One-hot encoding 𝑦 would be a three-dimensional vector, with (1,0,0) corresponding to “cat”, (0,1,0) to “chicken”, and (0,0,1) to “dog”

5 of 20

What does our model output?

We might want to output hard labels, i.e., a definite assignment to one of the 3 classes
Or we might prefer soft labels, i.e. probabilities
E.g., (0.5, 0, 0.5) to 50-50 chance for cat or dog
Need a model with multiple outputs, one per class
If restrict ourselves to linear (actually affine) models then we need 3 affine functions
Each affine function has 4 weights and 1 bias

6 of 20

Network Architecture

8 of 20

Compact matrix notation

We gather all of our weights into a 3×4 matrix W
For features of a given data example 𝐱 , our outputs are given by:�a matrix-vector product of weights with features plus biases 𝐛
𝐨 = 𝐖𝐱+𝐛
Left dimension: output 3 x 1
Right dimensions:

Matrix-vector product: (3 x 4) x (4 x 1)
Bias: 3 x 1

9 of 20

Parameterization Cost

10 of 20

Parameters in a fully connected layer

Fully-connected layers are ubiquitous in deep learning
Fully-connected layers have many learnable parameters
For a f.c. layer with 𝑑 inputs and 𝑞 outputs, the parameterization cost is O(𝑑𝑞)

11 of 20

Softmax Operation

12 of 20

Why softmax?

Recall linear model for probabilities
Outputs are not necessarily positive!
Outputs don’t sum to 1!
Violates basic probability laws

13 of 20

Softmax function

We exponentiate our outputs (to get positives values)
Then divide by the sum (to get them to sum to 1)

14 of 20

Properties of softmax function

It produces valid probabilities
It preserves ordering
Our model has now become: softmax(W x + b)
D2L book says this is still a linear model
I think a more appropriate description is a generalized linear model (you’re allowed to add one nonlinear function on top of a linear model)

15 of 20

Softmax output as conditional probabilities

On some input x softmax function gives us a vector��Where��we can interpret this as estimated conditional probabilities of each class given any input 𝐱 �

17 of 20

Likelihood

Suppose that the entire dataset {𝐗,𝐘} has 𝑛 examples
Example 𝑖 consists of a feature vector 𝐱^(𝑖) and a one-hot label vector 𝐲^(𝑖)
Probability of actual observed class labels given features

18 of 20