1 of 18

CW 2: KNN and

decision trees

2 of 18

Agenda and recap

Last week, we:

  • We coded, a lot

Today we will:

  • Spend the first hour on a KNN
  • Spend the second hour on decision tree classifiers

3 of 18

Remember: drop-in hours

They’ll now be Fridays from 11:45 to 1:00, here at B21.

Come talk about the lessons, things you didn’t understand, or we can help you to start a project using machine learning!

4 of 18

Peer instruction: what does the code do?

To anaconda!

5 of 18

New data structure: dictionaries

Remember these old paper things?

Well they are actually a useful data structure in programming.

They are identical to lists but instead of calling elements of a list through its index (ex list[0]), each element of the list has a name. Let’s do an example.

6 of 18

On to KNN!

Who got around to doing the KNN POGIL this week?

if (POGIL.completed):

visit(“www.computing-workshop.com/knn.html”)

else:

POGIL.do()

7 of 18

KNN - put it in your own words

At your table, creating a sentence that summarizes what a KNN algorithm does.

When you’re done write it on the whiteboard!

8 of 18

Algorithmic approach to KNN

Recall, an algorithm is just a set of rules or procedures, like brushing your teeth.

Now that we have a taste of both KNN and algorithms, answer the following questions:

  • What is the relevant data our KNN algorithm needs to work with?
  • What are the steps the algorithm has to take to classify a new point?

When you’re done, write your answers on the board.

9 of 18

Break!

10 of 18

Decision Trees

To understand decision trees, let’s start by playing a game using decision trees:

http://animalgame.com/

Think of the most obscure animal you can think of to trick the algorithm!

11 of 18

So how does it work?

Try thinking like the algorithm. If you could only ask true or false questions, what questions would you ask to guess which of the animals in the list below someone is thinking about?

animals = [“cat”, “parrot”, “goldfish”, “shark”, “unicorn”]

12 of 18

New coding concept: libraries

Now, we’re going to talk about coding a decision tree algorithm. Sounds like a lot of work, right?

Well luckily, in coding, there is no need to reinvent the wheel. Once one person has figured out how to do a certain thing with code, their code can be shared so people don’t have to solve the same problem.

This is the concept of a library. A library is a collection of prewritten code like functions that we can import and use as we need.

We’re going to use libraries a lot to implement our ML algorithms, specifically a library called scikit-learn (sklearn for short).

13 of 18

Decision Tree Demo!

We’re going to code a classifier on a new dataset, iris. This classifier will identify which species of iris a new data point belongs to.

This data set uses four features: the length and width of the petal and the sepal. Three different classes, each species of irises (Iris setosa, Iris virginica and Iris versicolor).

Visit https://computing-workshop.com/courses.html and download the Decision Trees Notebook. Open it in Jupyter.

14 of 18

How do we make the splits? Gini impurity.

The decision tree learner (the algorithm that constructs the tree from the data) has to decide at what feature and what position to split on. For example, why is it that the root node asks the question “petal width (cm) <= 0.8” ?

The idea is to look at all the possible splits and choose the “best” one. Best in this context means that the split separates the data in a way that minimizes the expected misclassification probability at that node. This misclassification error rate is called the Gini impurity.

15 of 18

16 of 18

Decision trees (DTs)

Pros

  • Easy to understand and visualize.
  • “White box” - results can be explained.
  • Handles numerical and categorical data.
  • Using a tree is fast.

Cons

  • Susceptible to overfitting.
  • Finding an optimal DT is very slow, so most algorithms find suboptimal DTs.
  • DT learners can create biased trees if some labels dominate.

17 of 18

Overfitting

The green and black line both model the boundary between the two labels (colors) of training points. But the green line is overfit. It too closely follows the boundaries in its training data, so it is likely to have a higher error rate when given unseen data.

An overfit DT is usually very deep, involving lots of conditions. That’s why we used max_depth=3. This prevented overfitting.

18 of 18

Recap

Today we:

  • Learned a lot about Python in a very short period of time
  • Created a decision tree classifier for flowers

Next week (and the week after):

  • We’re moving on to neural nets, a machine learning algorithm that mimics the brain!