1 of 27

CSE 519: Data Science

Steven Skiena

Stony Brook University

Lecture 17: Logistic Regression and Classification

2 of 27

Classification Problems

Often we are given collections of examples labeled by class:

male / female?
democrat / republican?
spam / non-spam (ham)?
cancer / benign?

Classification assigns a label to an input record.

3 of 27

Regression for Classification

We could use linear regression to build from annotated examples by converting the class names to numbers:

male=0 / female=1
democrat=0 / republican=1
spam=1 / non-spam=0
cancer=1 / benign=0

Zero/one works for binary classifiers. By convention, the “positive” class gets 1 and the “negative” one 0.

4 of 27

Class Labels from Regression Lines

The regression line will cut through these classes, even through a separator exists.

Adding very +/- examples shifts the line but hurts the boundary

5 of 27

Decision Boundaries

Ideally, our two classes will be well-separated in feature space, so a line can partition them.

Logistic regression is a method to find the best separating line for a given training set.

6 of 27

Non-Linear Decision Boundaries

Logistic regression can find non-linear boundaries if seeded with non-linear features.

To get separating circles, we need to explicitly add features like x^2 and x1 * x2.

7 of 27

The Logit Function

We want a way to take a real variable and convert it to a probability:

f(0) = ½
f(infinity) = 1
f(-infinity) = 0

Logit can give the probability of x being in a particular class.

8 of 27

Logit for Classification

To extend logit to m variables, and set the threshold and steepness parameters, fit

Plug into logit to get a classifier:

9 of 27

Scoring for Logistic Regression

Assume each of our n training examples are labeled as to being in either class 0 or 1.

We seek to identify the coefficients w that give high probability on positive (class 1) items, and low probability on negative (class 0) items.

We need a cost function to value a probability p for an item of class c.

10 of 27

Costs for Positive / Negative Cases

We want zero error for the best probability, and increasing cost as the class prediction gets more and more wrong.

11 of 27

Logarithms of Probabilities

Observe that log(1) = 0. Thus it gives proper cost for correct predictions of positive items.

Observe that The cost function:

For negative instances, the cost function is:

12 of 27

Cost/Loss Function

We can use the zero/one labels as indicator variables to compute the loss function:

This works because the class indicator variables are 0 and 1.

13 of 27

Logistic Regression via Gradient Descent

The loss function here is convex, so we can find the parameters which best fit it by gradient descent.

Thus we can find the best linear separator between two classes (linear in the possibly non-linear features).

14 of 27

Logistic Gender Classification

Red region: 229 w / 63 m

Blue region:

223 m / 65 w

15 of 27

Issues in Logistic Classification

Balanced training class sizes
Multi-class classification
Hierarchical classification
Partition functions

16 of 27

Balanced Training Classes

Consider the optimal separating line for grossly unbalanced class sizes, say 1 positive example vs. 1,000,000 negative examples.

The best scoring line from logistic regression will try to be very far from the big cluster instead of the midpoint between them.

Use equal numbers of pos and neg examples.

17 of 27

Ways to Balance Classes

Work harder to find members of the minority class.
Discard elements from the bigger class.
Weigh the minority class more heavily, but beware of overfitting.
Replicate members of the smaller class, ideally with random perturbation.

18 of 27

SMOTE: Synthetic Minority Over-sampling Technique

Create synthetic data examples in the right places, between positive p and near neighbor n.

Pick a random point between p and n: p + c*(n-p), where random c is between 0 and 1.

19 of 27

Multi-class Classification

Classification tasks are not always binary.

Is a given movie a comedy, drama, action, documentary, etc.?

20 of 27

Encoding Multi-Classes: Bad Idea

It is natural to represent multiple classes by distinct numbers: blond=0, brown=1, red=2.

But unless the ordering of the classes reflect an increasing scale, the numbering is meaningless.

Classes on a Likert scale are ordinal.

4 stars - 3 stars - 2 stars - 1 star

Completely Agree
Mostly Agree
Slightly Agree
Slightly Disagree
Mostly Disagree
Completely Disagree

21 of 27

One Versus All Classifiers

We can build multi-class classifiers by building multiple independent binary classifiers.

Select the class of highest probability as the predicted label.

22 of 27

The Challenges of Many Classes

Multi-class classification gets much harder as the number of classes increase.

The monkey is right only 1/k times for k classes.

With many available classes, certain mistakes are more acceptable than others.
The actual population sizes of rare classes can easily become vanishingly small.

23 of 27

Hierarchical Classification

Grouping classes by similarity and building a taxonomy reduces the effective number of classes.

Imagenet has 27 categories (e.g. animal, appliance, food, furniture) on top of 21,841 subcategories.

Classify from top-down in tree.

24 of 27

Bayesian Priors

Having honest estimates for the probability distribution of classes is essential to avoid domination by rare classes (“rock stars”)

Here A is the given class, and B is the event your classifier returned a guess of class A.

25 of 27

Multilabel Classification

Arises in document classification and images with multiple objects.

Metrics like precision/recall at k become more appropriate.

Conditional dependence between classes can give priors to better interpret label probabilities.

26 of 27

Partition Functions

There is no reason why independent binary classifiers yield probabilities which sum to 1.

To get real probabilities for Bayes theorem:

and P(c) = F_c(x) / T.

Alternately, softmax is

27 of 27

Softmax vs. Standard Normalization