1 of 16

Lecture 36

Classifiers

DATA 8

Summer 2017

Slides created by John DeNero (denero@berkeley.edu), Ani Adhikari (adhikari@berkeley.edu), and Sam Lau (samlau95@berkeley.edu)

2 of 16

Announcements

3 of 16

Classifiers

4 of 16

Training a Classifier

Classifier

Attributes of an example

Predicted label of the example

Population

Labels

Sample

Training

Set

Test

Set

Model the association between attributes & labels

Estimate the accuracy of the classifier

5 of 16

Nearest Neighbor Classifier

NN Classifier

Use the label of the most similar training example

Attributes of an example

Predicted label of the example

Population

Sample

Labels

Training

Set

Test

Set

6 of 16

The Google Science Fair

  • Brittany Wenger, a 17-year-old high school student in 2012
  • Won by building a breast cancer classifier with 99% accuracy

(Demo)

7 of 16

Distance

8 of 16

Rows of Tables

Each row contains all the data for one individual

  • t.row(i) evaluates to ith row of table t
  • t.row(i).item(j)is the value of column j in row i
  • If all values are numbers, then np.array(t.row(i)) evaluates to an array of all the numbers in the row.
  • To consider each row individually, use�for row in t.rows:� ... row.item(j) ...

9 of 16

Distance Between Two Points

  • Two attributes x and y:
  • Three attributes x, y, and z:
  • and so on ...

(Demo)

10 of 16

Attendance

11 of 16

Nearest Neighbors

12 of 16

Finding the k Nearest Neighbors

To find the k nearest neighbors of an example:

  • Find the distance between the example and each example in the training set
  • Augment the training data table with a column containing all the distances
  • Sort the augmented table in increasing order of the distances
  • Take the top k rows of the sorted table

(Demo)

13 of 16

The Classifier

To classify a point:

  • Find its k nearest neighbors

  • Take a majority vote of the k nearest neighbors to see which of the two classes appears more often

  • Assign the point the class that wins the majority vote

(Demo)

14 of 16

Evaluation

15 of 16

Accuracy of a Classifier

The accuracy of a classifier on a labeled data set is the proportion of examples that are labeled correctly

Need to compare classifier predictions to true labels

If the labeled data set is sampled at random from a population, then we can infer accuracy on that population

Sample

Labels

Training

Set

Test

Set

(Demo)

16 of 16

Decision Boundaries

  • A change in input attributes might change the prediction
  • Inputs that are very close but result in different predicted labels are on either side of a decision boundary
  • To visualize, plot predictions of a regular set of inputs

(Demo)