CSE 519: Data Science
Steven Skiena
Stony Brook University
Lecture 17: Logistic Regression and Classification
Classification Problems
Often we are given collections of examples labeled by class:
Classification assigns a label to an input record.
Regression for Classification
We could use linear regression to build from annotated examples by converting the class names to numbers:
Zero/one works for binary classifiers. By convention, the “positive” class gets 1 and the “negative” one 0.
Class Labels from Regression Lines
The regression line will cut through these classes, even through a separator exists.
Adding very +/- examples shifts the line but hurts the boundary
Decision Boundaries
Ideally, our two classes will be well-separated in feature space, so a line can partition them.
Logistic regression is a method to find the best separating line for a given training set.
Non-Linear Decision Boundaries
Logistic regression can find non-linear boundaries if seeded with non-linear features.
To get separating circles, we need to explicitly add features like x^2 and x1 * x2.
The Logit Function
We want a way to take a real variable and convert it to a probability:
Logit can give the probability of x being in a particular class.
Logit for Classification
To extend logit to m variables, and set the threshold and steepness parameters, fit
Plug into logit to get a classifier:
Scoring for Logistic Regression
Assume each of our n training examples are labeled as to being in either class 0 or 1.
We seek to identify the coefficients w that give high probability on positive (class 1) items, and low probability on negative (class 0) items.
We need a cost function to value a probability p for an item of class c.
Costs for Positive / Negative Cases
We want zero error for the best probability, and increasing cost as the class prediction gets more and more wrong.
Logarithms of Probabilities
Observe that log(1) = 0. Thus it gives proper cost for correct predictions of positive items.
Observe that The cost function:
For negative instances, the cost function is:
Cost/Loss Function
We can use the zero/one labels as indicator variables to compute the loss function:
This works because the class indicator variables are 0 and 1.
Logistic Regression via Gradient Descent
The loss function here is convex, so we can find the parameters which best fit it by gradient descent.
Thus we can find the best linear separator between two classes (linear in the possibly non-linear features).
Logistic Gender Classification
Red region: 229 w / 63 m
Blue region:
223 m / 65 w
Issues in Logistic Classification
Balanced Training Classes
Consider the optimal separating line for grossly unbalanced class sizes, say 1 positive example vs. 1,000,000 negative examples.
The best scoring line from logistic regression will try to be very far from the big cluster instead of the midpoint between them.
Use equal numbers of pos and neg examples.
Ways to Balance Classes
SMOTE: Synthetic Minority Over-sampling Technique
Create synthetic data examples in the right places, between positive p and near neighbor n.
Pick a random point between p and n: p + c*(n-p), where random c is between 0 and 1.
Multi-class Classification
Classification tasks are not always binary.
Is a given movie a comedy, drama, action, documentary, etc.?
Encoding Multi-Classes: Bad Idea
It is natural to represent multiple classes by distinct numbers: blond=0, brown=1, red=2.
But unless the ordering of the classes reflect an increasing scale, the numbering is meaningless.
Classes on a Likert scale are ordinal.
4 stars - 3 stars - 2 stars - 1 star
One Versus All Classifiers
We can build multi-class classifiers by building multiple independent binary classifiers.
Select the class of highest probability as the predicted label.
The Challenges of Many Classes
Multi-class classification gets much harder as the number of classes increase.
The monkey is right only 1/k times for k classes.
Hierarchical Classification
Grouping classes by similarity and building a taxonomy reduces the effective number of classes.
Imagenet has 27 categories (e.g. animal, appliance, food, furniture) on top of 21,841 subcategories.
Classify from top-down in tree.
Bayesian Priors
Having honest estimates for the probability distribution of classes is essential to avoid domination by rare classes (“rock stars”)
Here A is the given class, and B is the event your classifier returned a guess of class A.
Multilabel Classification
Arises in document classification and images with multiple objects.
Metrics like precision/recall at k become more appropriate.
Conditional dependence between classes can give priors to better interpret label probabilities.
Partition Functions
There is no reason why independent binary classifiers yield probabilities which sum to 1.
To get real probabilities for Bayes theorem:
and P(c) = F_c(x) / T.
Alternately, softmax is
Softmax vs. Standard Normalization