1 of 67

1

Applied Data Analysis (CS401)

Lecture 7

Learning from data:

Supervised learning

1 Nov 2023

Robert West

2 of 67

2

Happy (belated) Halloween!
Homework H1 is being graded. Feedback to�be released next week.
Project milestone P2 due on Fri 17 Nov
Friday’s lab session: two parallel tracks:

Quiz 6
Track 1: exercise on supervised learning (in BCH 2201)
Track 2: project office hours (on Zoom)

Logistics: see Ed post
Do come and ask for feedback – everyone will win!

Annowwwwwncements

3 of 67

Feedback

3

Give us feedback on this lecture here: https://go.epfl.ch/ada2023-lec7-feedback

What did you (not) like about this lecture?
What was (not) well explained?
On what would you like more (fewer) details?
Where is Pumpkin Pete?
…

4 of 67

Why ML as part of data science?

4

ML can facilitate most steps of the data analysis cycle��
But stay critical: “Can I trust my ML model?”

5 of 67

Machine learning

Supervised: We are given input/output pairs (X, y) (a.k.a. “samples”) that are related via a function y = f(X). We would like to “learn” f, and evaluate it on new data. Types:

Discrete y (class labels): “classification”
Continuous y: “regression” (e.g., linear regression)

Unsupervised: Given only samples X of the data, we compute a function f such that y = f(X) is a “simpler” representation.

Discrete y (cluster labels): “clustering”
Continuous y: “dimensionality reduction”

6 of 67

Machine learning: examples

Supervised (lecture 7, i.e., today):

Is this image a cat, dog, or cow?
How would this user rate that restaurant?
Is this email spam?
Is this blob on a telescope image a supernova?

Unsupervised (lecture 9):

Cluster handwritten digit data into 10 classes
What are the top 20 topics in Twitter right now?
Find the best 2D visualization of 1000-dimensional data

7 of 67

Machine learning: techniques

Supervised learning:

k-NN (k nearest neighbors)
Tree-based models: decision trees, random forests
Linear + logistic regression
Naïve Bayes
Support vector machines
Supervised neural networks
etc.

Unsupervised learning:

Clustering
Dimensionality reduction: topic modeling, matrix factorization (PCA, SVD, word2vec)
Hidden Markov models (HMM)
etc.

Today

(particularly in light of bias/variance tradeoff)

Lectures 9, 10, 11

8 of 67

Intro to supervised learning:

k nearest neighbors (k-NN)

9 of 67

k nearest neighbors (k-NN)

Given a query item:�Find k closest matches�in a labeled dataset ↓

10 of 67

k nearest neighbors (k-NN)

Given a query item: Return the most�Find k closest matches frequent label

in a labeled dataset ↓ among the k

11 of 67

k nearest neighbors (k-NN)

k = 3 votes for “cat”

12 of 67

Properties of k-NN

The data is the model

No training needed.
Conceptually simple algorithm.
Accuracy generally improves with more data.
Usually need data in memory, but can also be run from disk.

Minimal configuration:

Only one parameter: k (number of neighbors)
But two other choices are also important:

Similarity metric
Weighting of neighbors in voting (e.g. by similarity)

13 of 67

k-NN flavors

Classification:

Model is y = f(X), y is from a discrete set (labels).
Given X, compute y = majority vote of the k nearest neighbors.
Can also use a weighted vote* of the neighbors.

Regression:

Model is y = f(X), y is a real value.
Given X, compute y = average value of the k nearest neighbors.
Can also use a weighted average* of the neighbors.

* Weighting function is usually the similarity.

14 of 67

k-NN distance (opposite of similarity) measures

15 of 67

k-NN distance (opposite of similarity) measures

16 of 67

stack.push(kNN)

16

17 of 67

Predicting from samples

Most datasets are samples from a (maybe infinite) population.
We are most interested in models of the population, but we only have access to a sample (blue points) from the population.

For a dataset consisting of pairs (X, y):

features X, label y,

we aim to find the true model f:

y = f(X).

We train on a training sample D�and denote the fitted model as f_D

18 of 67

Bias and variance

Given a random training sample D, obtain model f_D
For a new data point (X, y), prediction is f_D(X)
(Squared) error = E[(f_D(X) – y)²] (E is expectation over D!)
Fact: error can be decomposed into two parts (derivation)

Error² = Bias² + Variance
Bias = E[f_D(X) – y]
Variance = E[(f_D(X) – E[f_D(X)])²]

19 of 67

Bias and variance

20 of 67

Consider a fixed�testing point (X, y)

not seen during training

true y = f(X)

f_D₁(X)

f_D₂(X)

…

“Full” bias/variance: average this picture over all testing points (X, y)

21 of 67

Bias/variance tradeoff

Since Error² = Bias² + Variance, there is a tradeoff, usually modulated via model complexity:

Complex models (many parameters) usually have lower bias, but higher variance.

Simple models (few parameters) have higher bias, but

lower variance.

22 of 67

Bias/variance tradeoff

Example: A linear model can only fit a straight line. A high-degree polynomial can fit a complex curve. But the polynomial will fit the individual training sample, rather than the full population. Its shape can vary from sample to sample, so it has high variance.

X

y

X

y

23 of 67

Bias/variance tradeoff

24 of 67

kNN = stack.pop()

24

25 of 67

Choosing k for k nearest neighbors

We have a bias/variance tradeoff:

Small k → ?
Large k → ?

THINK FOR A MINUTE:

When k increases,�how do bias and variance change?

(Feel free to discuss with your neighbor.)

26 of 67

Choosing k

Small k → low bias, high variance
Large k → high bias, low variance
Assume the real data follows the blue curve, with zero-mean additive noise. Red points are a data sample.

X

y

y = f(X)

27 of 67

Choosing k

Small k → low bias, high variance
Large k → high bias, low variance

X

y

k = 1

Bias = average�of this offset

Points from�original sample

y = f(X)

X

y

28 of 67

Choosing k

Small k → low bias, high variance
Large k → high bias, low variance

X

k = 1

y

Bias = average�of this offset

Points from a �different sample

y = f(X)

X

y

29 of 67

Choosing k

Small k → low bias, high variance
Large k → high bias, low variance

X

y

k = 8

Bias = average�of this offset

y = f(X)

X

y

30 of 67

Choosing k

Small k → low bias, high variance
Large k → high bias, low variance

X

y

k = 8

Bias = average�of this offset

y = f(X)

X

y

31 of 67

Choosing k in practice

Use leave-one-out (LOO) cross-validation:

Split: Break data into train and test subsets, e.g. 80–20 % random split.
Predict: For each point in the training set, predict using the k nearest neighbors from the set of all other points in training set. Measure the LOO error rate (classification) or squared error (regression).
Tune: Try different values of k, and use the one that gives minimum leave-one-out error.
Evaluate: Measure error on the test set to quantify performance.

32 of 67

Commercial break

32

Trick or data!

Data is our candy.

33 of 67

Decision trees

34 of 67

Decision trees: example

Age?

Student?

Credit rating?

<=30

>40

yes

31..40

yes

no

yes

no

fair

excellent

35 of 67

Decision trees

Model: Flow-chart-like tree structure

Nodes are tests on a single attribute
Branches are attribute values of parent node
Leaves are marked with class labels

Goal: Find decision tree that maximizes classification accuracy on given dataset

Optimization:

NP-hard
Heuristic: greedy top-down tree construction + pruning

36 of 67

Decision tree induction

Tree construction (top-down divide-and-conquer strategy)

At the beginning, all training samples belong to the root
Examples are partitioned recursively based on selected “most discriminative” attributes (partitioning data into most homogeneous subsets)
Discriminative power based on information gain (in ID3 and C4.5 algorithms) or Gini impurity (in CART algorithm)

Partitioning stops if

All samples belong to the same class → assign the class label to the leaf
There are no attributes left → majority voting to assign the class label to the leaf

37 of 67

Decision tree induction

Age?

<=30

>40

yes

31..40

38 of 67

Decision tree induction

Age?

Student?

<=30

>40

yes

31..40

yes

no

39 of 67

Decision tree induction

Age?

Student?

Credit rating?

<=30

>40

yes

31..40

yes

no

yes

fair

no

excellent

40 of 67

Attribute selection

For a given branch in the tree, the set of samples S to be classified has�P positive and N negative samples

The amount of entropy in the set S is

Note that:

If P = 0 or N = 0: H(P, N) = 0 → no uncertainty
If P = N: H(P, N) = 1 → maximum uncertainty

P/(P+N)

H(P, N)

41 of 67

Attribute selection

H_S = H(9, 5) = 0.94

Age [<=30] H(2, 3) = 0.97 Income [high] H(2, 2) = 1

Age [31...40] H(4, 0) = 0 Income [med] H(4, 2) = 0.92

Age [>40] H(3, 2) = 0.97 Income [low] H(3, 1) = 0.81

Student [yes] H(6, 1) = 0.59 Rating [fair] H(6, 2) = 0.81

Student [no] H(3, 4) = 0.98 Rating [exc] H(3, 3) = 1

42 of 67

Attribute selection

H_S = H(9, 5) = 0.94

H_Age = p([<=30]) ∙ H(2, 3) + p([31...40]) ∙ H(4, 0) + p([>40]) ∙ H(3, 2) =

= 5/14 ∙ 0.97 + 4/14 ∙ 0 + 5/14 ∙ 0.97 = 0.69

H_Income = p([high]) ∙ H(2, 2) + p([med]) ∙ H(4, 2) + p([low]) ∙ H(3, 1) =

= 4/14 ∙ 1 + 6/14 ∙ 0.92 + 4/14 ∙ 0.81 = 0.91

H_Student = p([yes]) ∙ H(6, 1) + p([no]) ∙ H(3, 4) = 7/14 ∙ 0.59 + 7/14 ∙ 0.98 = 0.78

H_Rating = p([fair]) ∙ H(6, 2) + p([exc]) ∙ H(3, 3) = 8/14 ∙ 0.81 + 6/14 ∙ 1 = 0.89

43 of 67

Attribute selection

Attribute A partitions S into S₁, S₂, ... S_v

Entropy of attribute A is

The information gain obtained by splitting S using A is

Gain(Age) = 0.94 – 0.69 = 0.25

Gain(Income) = 0.94 – 0.91 = 0.03

Gain(Student) = 0.94 – 0.78 = 0.16

Gain(Rating) = 0.94 – 0.89 = 0.05

← split on age

44 of 67

Pruning

The construction phase does not filter

out noise → overfitting

Many possible pruning strategies

Stop partitioning a node when the corresponding number of samples assigned to a leaf goes below a threshold
Bottom-up cross validation: Build the full tree and replace nodes with leaves labeled with the majority class if classification accuracy on a validation set (not seen during training!) does not get worse this way

45 of 67

Comments

Decision trees are an example of a classification algorithm

Many other out there (k-NN, naive Bayes, SVM, neural networks, logistic regression, random forests ...)

Maybe not the best one ...

Sensitive to small perturbation in the data (high variance)
Tend to overfit
Non-incremental: Need to be re-trained from scratch if new training data becomes available

46 of 67

Decision tree models

As tree depth increases, how do bias and variance change? (Hint: think about k-NN)

POLLING TIME

Scan QR code or go to�https://web.speakup.info/room/join/66626

47 of 67

Decision tree models

As tree depth increases, bias decreases and variance generally increases. Why? (Hint: think about k-NN)

Bias decreases�with tree depth

Variance increases�with tree depth

48 of 67

Ensemble methods

Are, metaphorically, like “democratic” machine learning algorithms:

Take a collection of simple or weak learners
Combine their results to make a single, better learner

Types:

Bagging: train learners in parallel on different samples of the data, then combine by voting (for discrete output) or by averaging (for continuous output).
Boosting: train learner again, but after filtering/weighting samples based on output of previous train/test runs.
Stacking: combine outputs from various models using a second-stage learner (e.g., linear regression).

49 of 67

Random forests

Grow K trees on datasets sampled from the original dataset (size N) with replacement (bootstrap samples), p = number of features.

Draw K bootstrap samples of size N
Grow each decision tree by selecting a random set of m out of p features at each node and choosing the best feature to split on.
At testing time, aggregate the predictions of the trees (most popular vote, or average) to produce the final class (example of bagging).

Typically m might be e.g. sqrt(p), but can be smaller.

50 of 67

Random forests

Principles: we want to take a vote between different learners so we don’t want the models to be too similar. The following two criteria ensure diversity in the individual trees:

Draw K bootstrap samples of size N:

Each tree is trained on different data.

Grow a decision tree by selecting a random set of m out of p features at each node, and choosing the best feature to split on.

Corresponding nodes in different trees (usually) can’t use the same feature to split on.

51 of 67

Random forests

Very popular in practice, probably the most popular classifier for dense data (up to a few thousand features)
Easy to implement (simply train many normal decision trees)
Easy to parallelize

Needs many passes over the data – at least the max depth of the trees (<< boosted trees though, cf. next slide)

52 of 67

Boosted decision trees

A more recent alternative to random forests (RF) [good intro here]
In contrast to RF, whose trees are trained independently by bagging, BDT trees are trained sequentially by boosting: Each tree is trained to predict (“correct”) residual errors of previous trees (→ bias reduction).
Final prediction: sum of predictions made by individual trees.
Both RF and boosted trees can produce very high-quality models. Superiority of one method or the other is dataset-dependent.
Resource requirements are very different as well, so it’s actually non-trivial to compare the methods.

53 of 67

Random forests vs. boosted trees

The “geometry” of the methods is very different:
Random forests use 10’s of deep, large trees:

…

Depth �10-30

10’s of trees

100k’s of nodes

Bias reduction

through depth

Variance reduction through the ensemble aggregate

54 of 67

Random forests vs. boosted trees

The “geometry” of the methods is very different:
Boosted decision trees use 1000’s of shallow, small trees:

…

Depth �4–8

1000’s of trees

100’s of nodes

Bias reduction through boosting – variance already low

55 of 67

Random forests vs. boosted trees

RF training embarrassingly parallel, can be very fast
Evaluation of trees (runtime) also much faster for RFs

…

Depth �10-30

10’s of trees

100k’s of nodes

…

Depth �4-8

1000’s of trees

100’s of nodes

56 of 67

For your personal perusal:�“A visual introduction�to machine learning”

http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

57 of 67

Linear and logistic regression

58 of 67

Linear regression

Your good friend from lecture 5 on regression analysis
Goal: find the “best” line (linear function y = f(X)) to explain the data

X

y

59 of 67

How to model binary events?

E.g., X: student features; y: did student pass ADA?
Desired output: f(X) = probability of passing ADA, given feats X
Problem with linear regression:�f(X) can be below 0 or above 1

60 of 67

Logistic regression

Trick: don’t deal with probabilities, which range from 0 to 1,�but with log odds, which range from –∞ to +∞
Probability y ⇔ odds y/(1–y) ⇔ log odds log[y/(1–y)]
Model log odds as a linear function of X

Want this!

Bad!

61 of 67

Logistic regression

Model log odds as a linear function of X
β^TX = log[y/(1–y)]
Solve for y: y = 1 / (1 + exp(–β^TX))
Finding best model β via maximum likelihood:

Don’t use square loss as in linear regression (where y is assumed to be generated from Normal distribution)
Use cross-entropy loss instead (y assumed to be generated from Bernoulli distribution, i.e., biased coin)

“sigmoid”

62 of 67

Overfitting

The more features the better?

No!
More features mean less bias, but more variance
Overfitting

Carefully selected features can improve model accuracy

E.g., keep features that correlate with the label y
Forward/backward feature selection
Regularization (e.g., penalize norm of weight vector)

More on such practical aspects: next lecture (“applied ML”)

63 of 67

Feedback

63

Give us feedback on this lecture here: https://go.epfl.ch/ada2023-lec7-feedback

What did you (not) like about this lecture?
What was (not) well explained?
On what would you like more (fewer) details?
Where is Pumpkin Pete?
…

64 of 67

Criteria

Predictive performance (accuracy, AUC/ROC, precision, recall, F1-score, etc.)

Speed and scalability

Time to build the model
Time to use the model
In memory vs. on disk processing
Communication cost

Robustness

Handling noise, outliers, missing values

Interpretability

Understanding the model and its decisions (black box vs. white box)

Compactness of the model

Mobile and embedded devices

65 of 67

k-NN and the curse of dimensionality

The curse of dimensionality refers to “weird” phenomena that occur in high dimensions (100s to millions) that do not occur in low-dimensional (e.g. 3-dimensional) space.

In particular data in high dimensions are much sparser (less dense) than data in low dimensions.

For k-NN, this means there are fewer points that are very close in feature space (very similar) to the point X whose y we want to predict.

66 of 67

k-NN and the curse of dimensionality

From this perspective, it’s surprising that kNN works at all in high dimensions.

Luckily real data are not like random points in a high-dimensional cube. Instead they live in dense clusters and near much lower-dimensional surfaces.

Also, points can be very “similar” even if their Euclidean distance is large. E.g. documents with the same few dominant words are likely to be on the same topic (--> use different distance)

67 of 67

k-NN and the curse of dimensionality