1 of 19

Lecture 36

Classifiers

DATA 8

Spring 2022

2 of 19

Announcements

Homework 11 due tomorrow (04/21)

Turn in tonight for a bonus point

Project 3 Checkpoint due Friday (04/22)

Entire project due Friday (04/29)

4 of 19

Nearest Neighbor Classifier

NN Classifier

Use the label of the most similar training example

Attributes of an example

Predicted label of the example

Population

Sample

Labels

Training

Set

Test

Set

5 of 19

The Google Science Fair

Brittany Wenger, a 17-year-old high school student in 2012
Won by building a breast cancer classifier with 99% accuracy

(Demo)

7 of 19

Pythagoras’ Formula

(x₀, y₀)

(x₁, y₁)

y₀ - y₁

x₀ - x₁

8 of 19

Distance Between Two Points

Two attributes x and y:

Three attributes x, y, and z:

and so on ...

(Demo)

10 of 19

Rows of Tables

Each row contains all the data for one individual

t.row(i) evaluates to ith row of table t
t.row(i).item(j)is the value of column j in row i
If all values are numbers, then np.array(t.row(i)) evaluates to an array of all the numbers in the row.
To consider each row individually, use�for row in t.rows:� ... row.item(j) …
t.exclude(i)evaluates to the table t without its ith row

11 of 19

Nearest Neighbors

12 of 19

Finding the k Nearest Neighbors

To find the k nearest neighbors of an example:

Find the distance between the example and each example in the training set
Augment the training data table with a column containing all the distances
Sort the augmented table in increasing order of the distances
Take the top k rows of the sorted table

13 of 19

The Classifier

To classify a point:

Find its k nearest neighbors

Take a majority vote of the k nearest neighbors to see which of the two classes appears more often

Assign the point to the class that wins the majority vote

(Demo)

15 of 19

Accuracy of a Classifier

The accuracy of a classifier on a labeled data set is the proportion of examples that are labeled correctly

Need to compare classifier predictions to true labels

If the labeled data set is sampled at random from a population, then we can infer accuracy on that population

Sample

Labels

Training

Set

Test

Set

(Demo)

16 of 19

Before Classifying

17 of 19

Dog or Wolf?

18 of 19

Start with a Representative Sample

Both the training and test sets must accurately represent the population on which you use your classifier

Overfitting happens when a classifier does very well on the training set, but can’t do as well on the test set

19 of 19

Standardize if Necessary

Chronic Kidney Disease data set

If the attributes are on very different numerical scales, distance can be affected
In such a situation, it is a good idea to convert all the variables to standard units

(Demo)

1 of 19

2 of 19

3 of 19

4 of 19

5 of 19

6 of 19

7 of 19

8 of 19

9 of 19

10 of 19

11 of 19

12 of 19

13 of 19

14 of 19

15 of 19

16 of 19

17 of 19

18 of 19

19 of 19