1 of 19

Lecture 36

Classifiers

DATA 8

Spring 2022

2 of 19

Announcements

  • Homework 11 due tomorrow (04/21)
    • Turn in tonight for a bonus point
  • Project 3 Checkpoint due Friday (04/22)
    • Entire project due Friday (04/29)

3 of 19

Classifiers

4 of 19

Nearest Neighbor Classifier

NN Classifier

Use the label of the most similar training example

Attributes of an example

Predicted label of the example

Population

Sample

Labels

Training

Set

Test

Set

5 of 19

The Google Science Fair

  • Brittany Wenger, a 17-year-old high school student in 2012
  • Won by building a breast cancer classifier with 99% accuracy

(Demo)

6 of 19

Distance

7 of 19

Pythagoras’ Formula

(x₀, y₀)

(x₁, y₁)

y₀ - y₁

x₀ - x₁

8 of 19

Distance Between Two Points

  • Two attributes x and y:
  • Three attributes x, y, and z:
  • and so on ...

(Demo)

9 of 19

Rows

10 of 19

Rows of Tables

Each row contains all the data for one individual

  • t.row(i) evaluates to ith row of table t
  • t.row(i).item(j)is the value of column j in row i
  • If all values are numbers, then np.array(t.row(i)) evaluates to an array of all the numbers in the row.
  • To consider each row individually, use�for row in t.rows:� ... row.item(j) …
  • t.exclude(i)evaluates to the table t without its ith row

11 of 19

Nearest Neighbors

12 of 19

Finding the k Nearest Neighbors

To find the k nearest neighbors of an example:

  • Find the distance between the example and each example in the training set
  • Augment the training data table with a column containing all the distances
  • Sort the augmented table in increasing order of the distances
  • Take the top k rows of the sorted table

13 of 19

The Classifier

To classify a point:

  • Find its k nearest neighbors

  • Take a majority vote of the k nearest neighbors to see which of the two classes appears more often

  • Assign the point to the class that wins the majority vote

(Demo)

14 of 19

Evaluation

15 of 19

Accuracy of a Classifier

The accuracy of a classifier on a labeled data set is the proportion of examples that are labeled correctly

Need to compare classifier predictions to true labels

If the labeled data set is sampled at random from a population, then we can infer accuracy on that population

Sample

Labels

Training

Set

Test

Set

(Demo)

16 of 19

Before Classifying

17 of 19

Dog or Wolf?

18 of 19

Start with a Representative Sample

  • Both the training and test sets must accurately represent the population on which you use your classifier

  • Overfitting happens when a classifier does very well on the training set, but can’t do as well on the test set

19 of 19

Standardize if Necessary

Chronic Kidney Disease data set

  • If the attributes are on very different numerical scales, distance can be affected
  • In such a situation, it is a good idea to convert all the variables to standard units

(Demo)