1 of 16

Lecture 36

Classifiers

DATA 8

Summer 2017

Slides created by John DeNero (denero@berkeley.edu), Ani Adhikari (adhikari@berkeley.edu), and Sam Lau (samlau95@berkeley.edu)

2 of 16

Announcements

3 of 16

Classifiers

4 of 16

Training a Classifier

Classifier

Attributes of an example

Predicted label of the example

Population

Labels

Sample

Training

Set

Test

Set

Model the association between attributes & labels

Estimate the accuracy of the classifier

Fraud: order fraud (Amazon, PayPal, etc.), account fraud (LinkedIn)

Dating: https://www.wired.com/2014/01/how-to-hack-okcupid/

If, through statistical sampling, McKinlay could ascertain which questions mattered to the kind of women he liked, he could construct a new profile that honestly answered those questions and ignored the rest. He could match every woman in LA who might be right for him, and none that weren’t.

Diagnosis: http://www.newyorker.com/magazine/2017/04/03/ai-versus-md

In June, 2015, Thrun’s team began to test what the machine had learned from the master set of images by presenting it with a “validation set”: some fourteen thousand images that had been diagnosed by dermatologists (although not necessarily by biopsy). Could the system correctly classify the images into three diagnostic categories—benign lesions, malignant lesions, and non-cancerous growths? The system got the answer right seventy-two per cent of the time. (The actual output of the algorithm is not “yes” or “no” but a probability that a given lesion belongs to a category of interest.) Two board-certified dermatologists who were tested alongside did worse: they got the answer correct sixty-six per cent of the time.

Personality: https://motherboard.vice.com/en_us/article/how-our-likes-helped-trump-win (Cambridge Analytica—Board member Steve Bannon)

Link personality tests to Facebook profiles

"Followers of Lady Gaga were most probably extroverts, while those who "liked" philosophy tended to be introverts. While each piece of such information is too weak to produce a reliable prediction, when tens, hundreds, or thousands of individual data points are combined, the resulting predictions become really accurate."

"Up to now, explains Nix, election campaigns have been organized based on demographic concepts. "A really ridiculous idea. The idea that all women should receive the same message because of their gender—or all African Americans because of their race." What Nix meant is that while other campaigners so far have relied on demographics, Cambridge Analytica was using psychometrics."

5 of 16

Nearest Neighbor Classifier

NN Classifier

Use the label of the most similar training example

Attributes of an example

Predicted label of the example

Population

Sample

Labels

Training

Set

Test

Set

Fraud: order fraud (Amazon, PayPal, etc.), account fraud (LinkedIn)

Dating: https://www.wired.com/2014/01/how-to-hack-okcupid/

If, through statistical sampling, McKinlay could ascertain which questions mattered to the kind of women he liked, he could construct a new profile that honestly answered those questions and ignored the rest. He could match every woman in LA who might be right for him, and none that weren’t.

Diagnosis: http://www.newyorker.com/magazine/2017/04/03/ai-versus-md

In June, 2015, Thrun’s team began to test what the machine had learned from the master set of images by presenting it with a “validation set”: some fourteen thousand images that had been diagnosed by dermatologists (although not necessarily by biopsy). Could the system correctly classify the images into three diagnostic categories—benign lesions, malignant lesions, and non-cancerous growths? The system got the answer right seventy-two per cent of the time. (The actual output of the algorithm is not “yes” or “no” but a probability that a given lesion belongs to a category of interest.) Two board-certified dermatologists who were tested alongside did worse: they got the answer correct sixty-six per cent of the time.

Personality: https://motherboard.vice.com/en_us/article/how-our-likes-helped-trump-win (Cambridge Analytica—Board member Steve Bannon)

Link personality tests to Facebook profiles

"Followers of Lady Gaga were most probably extroverts, while those who "liked" philosophy tended to be introverts. While each piece of such information is too weak to produce a reliable prediction, when tens, hundreds, or thousands of individual data points are combined, the resulting predictions become really accurate."

"Up to now, explains Nix, election campaigns have been organized based on demographic concepts. "A really ridiculous idea. The idea that all women should receive the same message because of their gender—or all African Americans because of their race." What Nix meant is that while other campaigners so far have relied on demographics, Cambridge Analytica was using psychometrics."

6 of 16

The Google Science Fair

Brittany Wenger, a 17-year-old high school student in 2012
Won by building a breast cancer classifier with 99% accuracy

(Demo)

Fraud: order fraud (Amazon, PayPal, etc.), account fraud (LinkedIn)

Dating: https://www.wired.com/2014/01/how-to-hack-okcupid/

If, through statistical sampling, McKinlay could ascertain which questions mattered to the kind of women he liked, he could construct a new profile that honestly answered those questions and ignored the rest. He could match every woman in LA who might be right for him, and none that weren’t.

Diagnosis: http://www.newyorker.com/magazine/2017/04/03/ai-versus-md

In June, 2015, Thrun’s team began to test what the machine had learned from the master set of images by presenting it with a “validation set”: some fourteen thousand images that had been diagnosed by dermatologists (although not necessarily by biopsy). Could the system correctly classify the images into three diagnostic categories—benign lesions, malignant lesions, and non-cancerous growths? The system got the answer right seventy-two per cent of the time. (The actual output of the algorithm is not “yes” or “no” but a probability that a given lesion belongs to a category of interest.) Two board-certified dermatologists who were tested alongside did worse: they got the answer correct sixty-six per cent of the time.

Personality: https://motherboard.vice.com/en_us/article/how-our-likes-helped-trump-win (Cambridge Analytica—Board member Steve Bannon)

Link personality tests to Facebook profiles

"Followers of Lady Gaga were most probably extroverts, while those who "liked" philosophy tended to be introverts. While each piece of such information is too weak to produce a reliable prediction, when tens, hundreds, or thousands of individual data points are combined, the resulting predictions become really accurate."

"Up to now, explains Nix, election campaigns have been organized based on demographic concepts. "A really ridiculous idea. The idea that all women should receive the same message because of their gender—or all African Americans because of their race." What Nix meant is that while other campaigners so far have relied on demographics, Cambridge Analytica was using psychometrics."

7 of 16

Distance

8 of 16

Rows of Tables

Each row contains all the data for one individual

t.row(i) evaluates to ith row of table t
t.row(i).item(j)is the value of column j in row i
If all values are numbers, then np.array(t.row(i)) evaluates to an array of all the numbers in the row.
To consider each row individually, use�for row in t.rows:� ... row.item(j) ...

9 of 16

Distance Between Two Points

Two attributes x and y:

Three attributes x, y, and z:

and so on ...

(Demo)

10 of 16

Attendance

bit.ly/at-d

11 of 16

Nearest Neighbors

12 of 16

Finding the k Nearest Neighbors

To find the k nearest neighbors of an example:

Find the distance between the example and each example in the training set
Augment the training data table with a column containing all the distances
Sort the augmented table in increasing order of the distances
Take the top k rows of the sorted table

(Demo)

13 of 16

The Classifier

To classify a point:

Find its k nearest neighbors

Take a majority vote of the k nearest neighbors to see which of the two classes appears more often

Assign the point the class that wins the majority vote

(Demo)

14 of 16

Evaluation

15 of 16

Accuracy of a Classifier

The accuracy of a classifier on a labeled data set is the proportion of examples that are labeled correctly

Need to compare classifier predictions to true labels

If the labeled data set is sampled at random from a population, then we can infer accuracy on that population

Sample

Labels

Training

Set

Test

Set

(Demo)

16 of 16

Decision Boundaries

A change in input attributes might change the prediction
Inputs that are very close but result in different predicted labels are on either side of a decision boundary
To visualize, plot predictions of a regular set of inputs

(Demo)