CW 2: KNN and
decision trees
Agenda and recap
Last week, we:
Today we will:
Remember: drop-in hours
They’ll now be Fridays from 11:45 to 1:00, here at B21.
Come talk about the lessons, things you didn’t understand, or we can help you to start a project using machine learning!
Peer instruction: what does the code do?
To anaconda!
New data structure: dictionaries
Remember these old paper things?
Well they are actually a useful data structure in programming.
They are identical to lists but instead of calling elements of a list through its index (ex list[0]), each element of the list has a name. Let’s do an example.
On to KNN!
Who got around to doing the KNN POGIL this week?
if (POGIL.completed):
visit(“www.computing-workshop.com/knn.html”)
else:
POGIL.do()
KNN - put it in your own words
At your table, creating a sentence that summarizes what a KNN algorithm does.
When you’re done write it on the whiteboard!
Algorithmic approach to KNN
Recall, an algorithm is just a set of rules or procedures, like brushing your teeth.
Now that we have a taste of both KNN and algorithms, answer the following questions:
When you’re done, write your answers on the board.
Break!
Decision Trees
To understand decision trees, let’s start by playing a game using decision trees:
Think of the most obscure animal you can think of to trick the algorithm!
So how does it work?
Try thinking like the algorithm. If you could only ask true or false questions, what questions would you ask to guess which of the animals in the list below someone is thinking about?
animals = [“cat”, “parrot”, “goldfish”, “shark”, “unicorn”]
New coding concept: libraries
Now, we’re going to talk about coding a decision tree algorithm. Sounds like a lot of work, right?
Well luckily, in coding, there is no need to reinvent the wheel. Once one person has figured out how to do a certain thing with code, their code can be shared so people don’t have to solve the same problem.
This is the concept of a library. A library is a collection of prewritten code like functions that we can import and use as we need.
We’re going to use libraries a lot to implement our ML algorithms, specifically a library called scikit-learn (sklearn for short).
Decision Tree Demo!
We’re going to code a classifier on a new dataset, iris. This classifier will identify which species of iris a new data point belongs to.
This data set uses four features: the length and width of the petal and the sepal. Three different classes, each species of irises (Iris setosa, Iris virginica and Iris versicolor).
Visit https://computing-workshop.com/courses.html and download the Decision Trees Notebook. Open it in Jupyter.
How do we make the splits? Gini impurity.
The decision tree learner (the algorithm that constructs the tree from the data) has to decide at what feature and what position to split on. For example, why is it that the root node asks the question “petal width (cm) <= 0.8” ?
The idea is to look at all the possible splits and choose the “best” one. Best in this context means that the split separates the data in a way that minimizes the expected misclassification probability at that node. This misclassification error rate is called the Gini impurity.
Decision trees (DTs)
Pros
Cons
Overfitting
The green and black line both model the boundary between the two labels (colors) of training points. But the green line is overfit. It too closely follows the boundaries in its training data, so it is likely to have a higher error rate when given unseen data.
An overfit DT is usually very deep, involving lots of conditions. That’s why we used max_depth=3. This prevented overfitting.
Recap
Today we:
Next week (and the week after):