1
Applied Data Analysis (CS401)
Robert West
Lecture 8
Supervised learning
2018/11/08
2
Announcements
3
Feedback
4
Give us feedback on this lecture here: https://go.epfl.ch/ada2018-lec8-feedback
Machine Learning
Machine Learning: Examples
Machine Learning: Techniques
Criteria
Predictive performance (accuracy, AUC/ROC, precision, recall, F1-score, etc.)
Speed and scalability
Robustness
Interpretability
Intro to supervised learning:
k nearest neighbors (kNN)
k Nearest Neighbors
Given a query item:�Find k closest matches�in a labeled dataset ↓
k Nearest Neighbors
Given a query item: Return the most�Find k closest matches Frequent label
k-Nearest Neighbors
k = 3 votes for “cat”
k-NN issues
The data is the model
Minimal configuration:
k-NN Flavors
Classification:
Regression:
* Weight function is usually the inverse distance.
k-NN distance measures
k-NN metrics
stack.push(kNN)
17
Predicting from Samples
For datasets consisting of (X,y)
For the true model f:
We train on a training sample D�and we denote the model as fD(X)�
Bias and Variance
Bias and Variance Tradeoff
There is usually a bias-variance tradeoff caused by model complexity.
Complex models (many parameters) usually have lower bias, but higher variance.
Simple models (few parameters) have higher bias, but
lower variance.
Bias and Variance Tradeoff
e.g. a linear model can only fit a straight line. A high-degree polynomial can fit a complex curve. But the polynomial can fit the individual sample, rather than the population. Its shape can vary from sample to sample, so it has high variance.
Bias and Variance Tradeoff
kNN = stack.pop()
24
Choosing k for k nearest neighbors
We have a bias/variance tradeoff:
Choosing k
X
y
y = f(X)
Choosing k
X
y
k=1
Bias = average�of this offset
Points from�original sample
y = f(X)
Choosing k
X
k=1
y
Bias = average�of this offset
Points from a �different sample
y = f(X)
Choosing k
X
y
k=8
Bias = average�of this offset
y = f(X)
Choosing k
X
y
k=8
Bias = average�of this offset
y = f(X)
Choosing k in practice
Use leave-one-out cross-validation (LOO): Break data into train and test subsets, e.g. 80-20 % random split.
Predict: For each point in the training set, predict using the k nearest neighbors from the set of all other points in training set. Measure the error rate (classification) or squared error (regression).
Tune: try different values of k, and use the one that gives minimum leave-one-out error
Evaluate: test on the test set to measure performance.
kNN and the curse of dimensionality
The curse of dimensionality refers to phenomena that occur in high dimensions (100s to millions) that do not occur in low-dimensional (e.g. 3-dimensional) space.
In particular data in high dimensions are much sparser (less dense) than data in low dimensions.
For kNN, that means there are fewer points that are very close in feature space (very similar), to the point we want to predict.
kNN and the curse of dimensionality
kNN and the curse of dimensionality
From this perspective, it’s surprising that kNN works at all in high dimensions.
Luckily real data are not like random points in a high-dimensional cube. Instead they live in dense clusters and near much lower-dimensional surfaces.
Also, points can be very “similar” even if their Euclidean distance is large. E.g. documents with the same few dominant words are likely to be on the same topic (--> use different distance)
Decision Trees
Decision trees: Example
Age?
Student?
Credit rating?
<=30
>40
yes
31..40
yes
yes
no
no
yes
no
fair
excellent
Decision trees
Model: flow-chart-like tree structure
Score function: classification accuracy
Optimization:
Decision tree induction
Tree construction (top-down divide-and-conquer strategy)
Partitioning stops if
Decision tree induction
Age?
<=30
>40
yes
31..40
Decision tree induction
Age?
Student?
<=30
>40
yes
31..40
yes
yes
no
no
Decision tree induction
Age?
Student?
Credit rating?
<=30
>40
yes
31..40
yes
yes
no
no
yes
fair
no
excellent
Attribute selection
At a given branch in the tree, the set of samples S to be classified has P positive and N negative samples
The amount of entropy in the set S is
Note that:
Attribute selection
HS = H(9, 5) = 0.94
Age [<=30] H(2, 3) = 0.97 Income [high] H(2, 2) = 1
Age [31...40] H(4, 0) = 0 Income [med] H(4, 2) = 0.92
Age [>40] H(3, 2) = 0.97 Income [low] H(3, 1) = 0.81
Student [yes] H(6, 1) = 0.59 Rating [fair] H(6, 2) = 0.81
Student [no] H(3, 4) = 0.98 Rating [exc] H(3, 3) = 1
Attribute selection
HS = H(9, 5) = 0.94
HAge = p([<=30]) ∙ H(2, 3) + p([31...40]) ∙ H(4, 0) + p([>40]) ∙ H(3, 2) =
= 5/14 ∙ 0.97 + 4/14 ∙ 0 + 5/14 ∙ 0.97 = 0.69
HIncome = p([high]) ∙ H(2, 2) + p([med]) ∙ H(4, 2) + p([low]) ∙ H(3, 1) =
= 4/14 ∙ 1 + 6/14 ∙ 0.92 + 4/14 ∙ 0.81 = 0.91
HStudent = p([yes]) ∙ H(6, 1) + p([no]) ∙ H(3, 4) = 7/14 ∙ 0.59 + 7/14 ∙ 0.98 = 0.78
HRating = p([fair]) ∙ H(6, 2) + p([exc]) ∙ H(3, 3) = 8/14 ∙ 0.81 + 6/14 ∙ 1 = 0.89
Attribute selection
Attribute A partitions S into S1, S2, ... Sv
Entropy of attribute A is
The information gain obtained by splitting S using A is
Gain(Age) = 0.94 – 0.69 = 0.25
Gain(Income) = 0.94 – 0.91 = 0.03
Gain(Student) = 0.94 – 0.78 = 0.16
Gain(Rating) = 0.94 – 0.89 = 0.05
← split on age
Pruning
The construction phase does not filter out noise → overfitting
Many possible pruning strategies
Comments
Decision trees are just an example of classification algorithm
Maybe not the best one ...
Decision Tree Models
Bias decreases�with tree depth
Variance increases�with tree depth
Ensemble Methods
Are like crowdsourced machine learning algorithms:
Types:
Random Forests
Grow K trees on datasets sampled from the original dataset (size N) with replacement (bootstrap samples), p = number of features.
Typically m might be e.g. sqrt(p) but can be smaller.
Random Forests
Principles: we want to take a vote between different learners so we don’t want the models to be too similar. These two criteria ensure diversity in the individual trees:
Random Forests
Boosted Decision Trees
Random Forests vs. Boosted Trees
…
Depth �20-30
10’s of trees
100k’s of nodes
Bias reduction
through depth
Variance reduction through the ensemble aggregate
Random Forests vs Boosted Trees
…
Depth �4-8
1000’s of trees
100’s of nodes
Bias reduction through boosting – variance already low
Random Forests vs Boosted Trees
…
Depth �20-30
10’s of trees
100k’s of nodes
…
Depth �4-8
1000’s of trees
100’s of nodes
On model transparency
DNNs are often considered as “opaque” boxes
Do you think it’s easier to interpret a model with 1000s of trees?!
Linear and logistic regression
Linear Regression
We want to find the “best” line (linear function y=f(X)) to explain the data.
X
y
Linear Regression
Least Squares Solution
Least Squares Solution
How to model binary events?
Logistic regression
Want this!
Bad!
Logistic regression
“sigmoid”
Overfitting
Feedback
68
Give us feedback on this lecture here: https://go.epfl.ch/ada2018-lec8-feedback
Classification
training
set
IF rank = “professor”
AND years > 6
THEN tenured = “yes”
categorical attributes
numerical
attribute
class
label
classification
algorithm
classifier
(model)
Linear Regression
R2-values and P-values
We can always fit a linear model to any dataset, but how do we know if there is a real linear relationship?
R2-values
R-squared
X
y
Small if good fit
R2-values and P-values
Statistic: From R-squared we can derive another statistic (using degrees of freedom) that has a standard distribution called an F-distribution.
From the CDF for the F-distribution, we can derive a P-value for the data.
The P-value is, as usual, the probability of observing the data under the null hypothesis of no linear relationship.
If p is small, say less than 0.05, we conclude that there is a linear relationship.
Ex. of correlation (and impact of each feature) explained via regression