Classification
Data 100
Slides by John DeNero
Announcements
Classifiers and Decisions
Classification Rules
A classifier is a function f(x) that outputs a prediction of y: 0 or 1.
Logistic regression finds a function that estimates P(Y=1|X).
Given a particular x to classify, the most common classification rule is:
Which is equivalent to:
Classification Rules
A classifier is a function f(x) that outputs a prediction of y: 0 or 1.
Logistic regression finds a function that estimates P(Y=1|X).
Given a particular x to classify, the most common classification rule is:
The threshold used to distinguish among outputs can be adjusted if certain types of errors are more problematic than others.
Appropriate decision rules depend on the application context. When prediction accuracy is important in an uncertain domain, the best output might not be to label an example 0 or 1 but instead to skip it.
Evaluating Classifiers
The loss used for model fitting is typically not the best evaluation metric.
Accuracy: (TP + TN) / n
Error rate: (FP + FN) / n
Precision: TP / (TP + FP)
Recall: TP / (TP + FN)
The most obvious/common evaluation metric
Used when detecting rare outcomes
| 1 | 0 |
1 | True positive (TP) | False positive (FP) |
0 | False negative (FN) | True negative (TN) |
Prediction
Truth
Linear Separability
For a set of observations , is it possible to classify them all perfectly using a linear function of x?
Does there exist β such that: xTβ < 0 for only and all x whose y is 0, and
xTβ ≥ 0 for only and all x whose y is 1?
If so, that data set is linearly separable and a classifier can have 100% training accuracy (and therefore 100% training precision and 100% training recall).
Concerns:
Empirical Risk Minimization (ERM)
Classification by Logistic Regression Using Cross-Entropy Loss
Binary Classification Prediction: Predict y∊{0,1} from features x.
Binary Classification: Estimate P(Y=1|X) = f(X) for unknown distribution over (X, Y).
Logistic Regression: Assume P(Y=1|X) = σ(XTβ) and estimate β; σ(t)=1/(1+exp(-t)).
To Find Parameters: Choose a loss (& regularization); minimize empirical risk.
Cross-Entropy Loss for Logistic Regression: -(Y log σ(XTβ) + (1-Y) log (1-σ(XTβ)))
Empirical Risk: For training (i.e. learning) set of observations
ERM
Logistic Regression and Gradient Descent
(Demo)
Numerical Optimization for Logistic Regression
Logistic regression parameter estimation has no closed-form (analytical) solution.
For data sets of moderate size (thousands of parameters, millions of examples), exact solutions can often be computed (but gradient descent is not often used).
For large data sets, good solutions can often be found quickly with mini-batch gradient descent, and finding parameters that are close to the minimum can be as good for generalization as minimizing parameters.
Some simple tricks for improving gradient descent:
Regularized Logistic Regression
(Demo)
Guidance for Applying Regularization
Idea: Trade off training set performance for performance on all other data.
Overfitting to the Held-Out Set
Trying a bunch of different regularization coefficients is numerical optimization.
Concerns:
A solution is to split labeled data further:
Multiclass Classification
Classification by Logistic Regression Using Cross-Entropy Loss
(Demo)
Multiclass Classification Prediction: Predict y∊{0,1,2,...,d} from features x.
Binary Classification: Estimate P(Y=y|X) = f(X) for each possible y∊{0,1,2,...,d}.
Multiclass Logistic Regression:
Minimize Empirical Risk with Cross-Entropy Loss: