1 of 16

Classification

Data 100

Slides by John DeNero

2 of 16

Announcements

3 of 16

Classifiers and Decisions

4 of 16

Classification Rules

A classifier is a function f(x) that outputs a prediction of y: 0 or 1.

Logistic regression finds a function that estimates P(Y=1|X).

Given a particular x to classify, the most common classification rule is:

Which is equivalent to:

5 of 16

Classification Rules

A classifier is a function f(x) that outputs a prediction of y: 0 or 1.

Logistic regression finds a function that estimates P(Y=1|X).

Given a particular x to classify, the most common classification rule is:

The threshold used to distinguish among outputs can be adjusted if certain types of errors are more problematic than others.

Appropriate decision rules depend on the application context. When prediction accuracy is important in an uncertain domain, the best output might not be to label an example 0 or 1 but instead to skip it.

6 of 16

Evaluating Classifiers

The loss used for model fitting is typically not the best evaluation metric.

Accuracy: (TP + TN) / n

Error rate: (FP + FN) / n

Precision: TP / (TP + FP)

Recall: TP / (TP + FN)

The most obvious/common evaluation metric

Used when detecting rare outcomes

1

0

1

True positive (TP)

False positive (FP)

0

False negative (FN)

True negative (TN)

Prediction

Truth

7 of 16

Linear Separability

For a set of observations , is it possible to classify them all perfectly using a linear function of x?

Does there exist β such that: xTβ < 0 for only and all x whose y is 0, and

xTβ ≥ 0 for only and all x whose y is 1?

If so, that data set is linearly separable and a classifier can have 100% training accuracy (and therefore 100% training precision and 100% training recall).

Concerns:

  • For separable data, β may not be unique.
  • A linearly separable data set may be a sample from a population that isn’t.
  • If both the sample and population are separable, it still might be the case that the β chosen on the sample does not separate the population.
  • For non-separable data (most data), low accuracy doesn’t always mean that the classifier is suboptimal; the population might just be difficult to classify.

8 of 16

Empirical Risk Minimization (ERM)

9 of 16

Classification by Logistic Regression Using Cross-Entropy Loss

Binary Classification Prediction: Predict y∊{0,1} from features x.

Binary Classification: Estimate P(Y=1|X) = f(X) for unknown distribution over (X, Y).

Logistic Regression: Assume P(Y=1|X) = σ(XTβ) and estimate β; σ(t)=1/(1+exp(-t)).

To Find Parameters: Choose a loss (& regularization); minimize empirical risk.

Cross-Entropy Loss for Logistic Regression: -(Y log σ(XTβ) + (1-Y) log (1-σ(XTβ)))

Empirical Risk: For training (i.e. learning) set of observations

ERM

10 of 16

Logistic Regression and Gradient Descent

(Demo)

11 of 16

Numerical Optimization for Logistic Regression

Logistic regression parameter estimation has no closed-form (analytical) solution.

For data sets of moderate size (thousands of parameters, millions of examples), exact solutions can often be computed (but gradient descent is not often used).

For large data sets, good solutions can often be found quickly with mini-batch gradient descent, and finding parameters that are close to the minimum can be as good for generalization as minimizing parameters.

Some simple tricks for improving gradient descent:

  • Start with a high learning rate and “decay” it as iteration count increases.
  • Clip gradients so that updates are never too large.

12 of 16

Regularized Logistic Regression

(Demo)

13 of 16

Guidance for Applying Regularization

Idea: Trade off training set performance for performance on all other data.

  • Regularization is a method to reduce overfitting.
  • In a train/test split, the test set is a sample of “all other data.”
  • Classifiers are often applied in practice to examples for which the true class labels are unknown (and will never be known to the classification system).
  • It’s not possible to know if regularization is helping by measuring training set loss: as training set performance decreases with more regularization, test set performance may be increasing or decreasing.
  • Overfitting may occur even if training set loss is lower than test set loss.
  • The basic recipe for choosing regularization is the same as for feature/model selection: for different configurations, train a classifier and test it on held-out data. Keep the configuration that performs best on held-out data.

14 of 16

Overfitting to the Held-Out Set

Trying a bunch of different regularization coefficients is numerical optimization.

Concerns:

  • Is classifier performance on the held-out set a good estimate for performance on other unseen held-out data?
  • When evaluating many different regularization configurations or feature sets on the same held-out test set, you may find a configuration that happens to be better for that sample than it would be on another sample.

A solution is to split labeled data further:

  • Training (learning) set is used to minimize regularized empirical risk.
  • Validation set is used to select models, features, and regularization.
  • Test set is used to estimate performance of the final choice.

15 of 16

Multiclass Classification

16 of 16

Classification by Logistic Regression Using Cross-Entropy Loss

(Demo)

Multiclass Classification Prediction: Predict y∊{0,1,2,...,d} from features x.

Binary Classification: Estimate P(Y=y|X) = f(X) for each possible y∊{0,1,2,...,d}.

Multiclass Logistic Regression:

Minimize Empirical Risk with Cross-Entropy Loss: