1
Applied Data Analysis (CS401)
Lecture 7
Learning from data:
Supervised learning
1 Nov 2023
Robert West
2
Annowwwwwncements
Feedback
3
Give us feedback on this lecture here: https://go.epfl.ch/ada2023-lec7-feedback
Why ML as part of data science?
4
Machine learning
Machine learning: examples
Machine learning: techniques
Today
(particularly in light of bias/variance tradeoff)
Lectures 9, 10, 11
Intro to supervised learning:
k nearest neighbors (k-NN)
k nearest neighbors (k-NN)
Given a query item:�Find k closest matches�in a labeled dataset ↓
k nearest neighbors (k-NN)
Given a query item: Return the most�Find k closest matches frequent label
in a labeled dataset ↓ among the k
k nearest neighbors (k-NN)
k = 3 votes for “cat”
Properties of k-NN
The data is the model
Minimal configuration:
k-NN flavors
Classification:
Regression:
* Weighting function is usually the similarity.
k-NN distance (opposite of similarity) measures
k-NN distance (opposite of similarity) measures
stack.push(kNN)
16
Predicting from samples
For a dataset consisting of pairs (X, y):
we aim to find the true model f:
We train on a training sample D�and denote the fitted model as fD
Bias and variance
Bias and variance
Consider a fixed�testing point (X, y)
not seen during training
true y = f(X)
fD1(X)
fD2(X)
…
“Full” bias/variance: average this picture over all testing points (X, y)
Bias/variance tradeoff
Since Error2 = Bias2 + Variance, there is a tradeoff, usually modulated via model complexity:
Complex models (many parameters) usually have lower bias, but higher variance.
Simple models (few parameters) have higher bias, but
lower variance.
Bias/variance tradeoff
Example: A linear model can only fit a straight line. A high-degree polynomial can fit a complex curve. But the polynomial will fit the individual training sample, rather than the full population. Its shape can vary from sample to sample, so it has high variance.
X
y
X
y
Bias/variance tradeoff
kNN = stack.pop()
24
Choosing k for k nearest neighbors
We have a bias/variance tradeoff:
THINK FOR A MINUTE:
When k increases,�how do bias and variance change?
(Feel free to discuss with your neighbor.)
Choosing k
X
y
y = f(X)
Choosing k
X
y
k = 1
Bias = average�of this offset
Points from�original sample
y = f(X)
X
y
Choosing k
X
k = 1
y
Bias = average�of this offset
Points from a �different sample
y = f(X)
X
y
Choosing k
X
y
k = 8
Bias = average�of this offset
y = f(X)
X
y
Choosing k
X
y
k = 8
Bias = average�of this offset
y = f(X)
X
y
Choosing k in practice
Use leave-one-out (LOO) cross-validation:
Commercial break
32
Trick or data!
Data is our candy.
Decision trees
Decision trees: example
Age?
Student?
Credit rating?
<=30
>40
yes
31..40
yes
yes
no
no
yes
no
fair
excellent
Decision trees
Model: Flow-chart-like tree structure
Goal: Find decision tree that maximizes classification accuracy on given dataset
Optimization:
Decision tree induction
Tree construction (top-down divide-and-conquer strategy)
Partitioning stops if
Decision tree induction
Age?
<=30
>40
yes
31..40
Decision tree induction
Age?
Student?
<=30
>40
yes
31..40
yes
yes
no
no
Decision tree induction
Age?
Student?
Credit rating?
<=30
>40
yes
31..40
yes
yes
no
no
yes
fair
no
excellent
Attribute selection
For a given branch in the tree, the set of samples S to be classified has�P positive and N negative samples
The amount of entropy in the set S is
Note that:
P/(P+N)
H(P, N)
Attribute selection
HS = H(9, 5) = 0.94
Age [<=30] H(2, 3) = 0.97 Income [high] H(2, 2) = 1
Age [31...40] H(4, 0) = 0 Income [med] H(4, 2) = 0.92
Age [>40] H(3, 2) = 0.97 Income [low] H(3, 1) = 0.81
Student [yes] H(6, 1) = 0.59 Rating [fair] H(6, 2) = 0.81
Student [no] H(3, 4) = 0.98 Rating [exc] H(3, 3) = 1
Attribute selection
HS = H(9, 5) = 0.94
HAge = p([<=30]) ∙ H(2, 3) + p([31...40]) ∙ H(4, 0) + p([>40]) ∙ H(3, 2) =
= 5/14 ∙ 0.97 + 4/14 ∙ 0 + 5/14 ∙ 0.97 = 0.69
HIncome = p([high]) ∙ H(2, 2) + p([med]) ∙ H(4, 2) + p([low]) ∙ H(3, 1) =
= 4/14 ∙ 1 + 6/14 ∙ 0.92 + 4/14 ∙ 0.81 = 0.91
HStudent = p([yes]) ∙ H(6, 1) + p([no]) ∙ H(3, 4) = 7/14 ∙ 0.59 + 7/14 ∙ 0.98 = 0.78
HRating = p([fair]) ∙ H(6, 2) + p([exc]) ∙ H(3, 3) = 8/14 ∙ 0.81 + 6/14 ∙ 1 = 0.89
Attribute selection
Attribute A partitions S into S1, S2, ... Sv
Entropy of attribute A is
The information gain obtained by splitting S using A is
Gain(Age) = 0.94 – 0.69 = 0.25
Gain(Income) = 0.94 – 0.91 = 0.03
Gain(Student) = 0.94 – 0.78 = 0.16
Gain(Rating) = 0.94 – 0.89 = 0.05
← split on age
Pruning
The construction phase does not filter
out noise → overfitting
Many possible pruning strategies
Comments
Decision trees are an example of a classification algorithm
Maybe not the best one ...
Decision tree models
Decision tree models
Bias decreases�with tree depth
Variance increases�with tree depth
Ensemble methods
Are, metaphorically, like “democratic” machine learning algorithms:
Types:
Random forests
Grow K trees on datasets sampled from the original dataset (size N) with replacement (bootstrap samples), p = number of features.
Typically m might be e.g. sqrt(p), but can be smaller.
Random forests
Principles: we want to take a vote between different learners so we don’t want the models to be too similar. The following two criteria ensure diversity in the individual trees:
Random forests
Boosted decision trees
Random forests vs. boosted trees
…
Depth �10-30
10’s of trees
100k’s of nodes
Bias reduction
through depth
Variance reduction through the ensemble aggregate
Random forests vs. boosted trees
…
Depth �4–8
1000’s of trees
100’s of nodes
Bias reduction through boosting – variance already low
Random forests vs. boosted trees
…
Depth �10-30
10’s of trees
100k’s of nodes
…
Depth �4-8
1000’s of trees
100’s of nodes
For your personal perusal:�“A visual introduction�to machine learning”
Linear and logistic regression
Linear regression
X
y
How to model binary events?
Logistic regression
Want this!
Bad!
Logistic regression
“sigmoid”
Overfitting
Feedback
63
Give us feedback on this lecture here: https://go.epfl.ch/ada2023-lec7-feedback
Criteria
Predictive performance (accuracy, AUC/ROC, precision, recall, F1-score, etc.)
Speed and scalability
Robustness
Interpretability
k-NN and the curse of dimensionality
The curse of dimensionality refers to “weird” phenomena that occur in high dimensions (100s to millions) that do not occur in low-dimensional (e.g. 3-dimensional) space.
In particular data in high dimensions are much sparser (less dense) than data in low dimensions.
For k-NN, this means there are fewer points that are very close in feature space (very similar) to the point X whose y we want to predict.
k-NN and the curse of dimensionality
From this perspective, it’s surprising that kNN works at all in high dimensions.
Luckily real data are not like random points in a high-dimensional cube. Instead they live in dense clusters and near much lower-dimensional surfaces.
Also, points can be very “similar” even if their Euclidean distance is large. E.g. documents with the same few dominant words are likely to be on the same topic (--> use different distance)
k-NN and the curse of dimensionality