1 of 67

1

Applied Data Analysis (CS401)

Lecture 7

Learning from data:

Supervised learning

1 Nov 2023

Robert West

2 of 67

2

  • Happy (belated) Halloween!
  • Homework H1 is being graded. Feedback to�be released next week.
  • Project milestone P2 due on Fri 17 Nov
  • Friday’s lab session: two parallel tracks:
    • Quiz 6
    • Track 1: exercise on supervised learning (in BCH 2201)
    • Track 2: project office hours (on Zoom)
      • Logistics: see Ed post
      • Do come and ask for feedback – everyone will win!

Annowwwwwncements

3 of 67

Feedback

3

Give us feedback on this lecture here: https://go.epfl.ch/ada2023-lec7-feedback

  • What did you (not) like about this lecture?
  • What was (not) well explained?
  • On what would you like more (fewer) details?
  • Where is Pumpkin Pete?

4 of 67

Why ML as part of data science?

4

  • ML can facilitate most steps of the data analysis cycle��
  • But stay critical: “Can I trust my ML model?”

5 of 67

Machine learning

  • Supervised: We are given input/output pairs (X, y) (a.k.a. “samples”) that are related via a function y = f(X). We would like to “learn” f, and evaluate it on new data. Types:
    • Discrete y (class labels): classification
    • Continuous y: “regression(e.g., linear regression)
  • Unsupervised: Given only samples X of the data, we compute a function f such that y = f(X) is a “simpler” representation.
    • Discrete y (cluster labels): “clustering”
    • Continuous y: “dimensionality reduction”

6 of 67

Machine learning: examples

  • Supervised (lecture 7, i.e., today):
    • Is this image a cat, dog, or cow?
    • How would this user rate that restaurant?
    • Is this email spam?
    • Is this blob on a telescope image a supernova?
  • Unsupervised (lecture 9):
    • Cluster handwritten digit data into 10 classes
    • What are the top 20 topics in Twitter right now?
    • Find the best 2D visualization of 1000-dimensional data

7 of 67

Machine learning: techniques

  • Supervised learning:
    • k-NN (k nearest neighbors)
    • Tree-based models: decision trees, random forests
    • Linear + logistic regression
    • Naïve Bayes
    • Support vector machines
    • Supervised neural networks
    • etc.
  • Unsupervised learning:
    • Clustering
    • Dimensionality reduction: topic modeling, matrix factorization (PCA, SVD, word2vec)
    • Hidden Markov models (HMM)
    • etc.

Today

(particularly in light of bias/variance tradeoff)

Lectures 9, 10, 11

8 of 67

Intro to supervised learning:

k nearest neighbors (k-NN)

9 of 67

k nearest neighbors (k-NN)

Given a query item:�Find k closest matches�in a labeled dataset ↓

10 of 67

k nearest neighbors (k-NN)

Given a query item: Return the most�Find k closest matches frequent label

in a labeled dataset ↓ among the k

11 of 67

k nearest neighbors (k-NN)

k = 3 votes for “cat”

12 of 67

Properties of k-NN

The data is the model

  • No training needed.
  • Conceptually simple algorithm.
  • Accuracy generally improves with more data.
  • Usually need data in memory, but can also be run from disk.

Minimal configuration:

  • Only one parameter: k (number of neighbors)
  • But two other choices are also important:
    • Similarity metric
    • Weighting of neighbors in voting (e.g. by similarity)

13 of 67

k-NN flavors

Classification:

  • Model is y = f(X), y is from a discrete set (labels).
  • Given X, compute y = majority vote of the k nearest neighbors.
  • Can also use a weighted vote* of the neighbors.

Regression:

  • Model is y = f(X), y is a real value.
  • Given X, compute y = average value of the k nearest neighbors.
  • Can also use a weighted average* of the neighbors.

* Weighting function is usually the similarity.

14 of 67

k-NN distance (opposite of similarity) measures

 

15 of 67

k-NN distance (opposite of similarity) measures

16 of 67

stack.push(kNN)

16

17 of 67

Predicting from samples

  • Most datasets are samples from a (maybe infinite) population.
  • We are most interested in models of the population, but we only have access to a sample (blue points) from the population.

For a dataset consisting of pairs (X, y):

  • features X, label y,

we aim to find the true model f:

  • y = f(X).

We train on a training sample D�and denote the fitted model as fD

18 of 67

Bias and variance

  • Given a random training sample D, obtain model fD
  • For a new data point (X, y), prediction is fD(X)
  • (Squared) error = E[(fD(X) – y)2] (E is expectation over D!)
  • Fact: error can be decomposed into two parts (derivation)
    • Error2 = Bias2 + Variance
    • Bias = E[fD(X) – y]
    • Variance = E[(fD(X) – E[fD(X)])2]

19 of 67

Bias and variance

 

20 of 67

Consider a fixed�testing point (X, y)

not seen during training

true y = f(X)

fD1(X)

fD2(X)

“Full” bias/variance: average this picture over all testing points (X, y)

21 of 67

Bias/variance tradeoff

Since Error2 = Bias2 + Variance, there is a tradeoff, usually modulated via model complexity:

Complex models (many parameters) usually have lower bias, but higher variance.

Simple models (few parameters) have higher bias, but

lower variance.

22 of 67

Bias/variance tradeoff

Example: A linear model can only fit a straight line. A high-degree polynomial can fit a complex curve. But the polynomial will fit the individual training sample, rather than the full population. Its shape can vary from sample to sample, so it has high variance.

X

y

X

y

23 of 67

Bias/variance tradeoff

 

24 of 67

kNN = stack.pop()

24

25 of 67

Choosing k for k nearest neighbors

We have a bias/variance tradeoff:

  • Small k → ?
  • Large k → ?

THINK FOR A MINUTE:

When k increases,�how do bias and variance change?

(Feel free to discuss with your neighbor.)

26 of 67

Choosing k

  • Small k → low bias, high variance
  • Large k → high bias, low variance
  • Assume the real data follows the blue curve, with zero-mean additive noise. Red points are a data sample.

X

y

y = f(X)

27 of 67

Choosing k

  • Small k → low bias, high variance
  • Large k → high bias, low variance

X

y

k = 1

Bias = average�of this offset

Points from�original sample

y = f(X)

X

y

28 of 67

Choosing k

  • Small k → low bias, high variance
  • Large k → high bias, low variance

X

k = 1

y

Bias = average�of this offset

Points from a �different sample

y = f(X)

X

y

29 of 67

Choosing k

  • Small k → low bias, high variance
  • Large k → high bias, low variance

X

y

k = 8

Bias = average�of this offset

y = f(X)

X

y

30 of 67

Choosing k

  • Small k → low bias, high variance
  • Large k → high bias, low variance

X

y

k = 8

Bias = average�of this offset

y = f(X)

X

y

31 of 67

Choosing k in practice

Use leave-one-out (LOO) cross-validation:

  • Split: Break data into train and test subsets, e.g. 8020 % random split.
  • Predict: For each point in the training set, predict using the k nearest neighbors from the set of all other points in training set. Measure the LOO error rate (classification) or squared error (regression).
  • Tune: Try different values of k, and use the one that gives minimum leave-one-out error.
  • Evaluate: Measure error on the test set to quantify performance.

32 of 67

Commercial break

32

Trick or data!

Data is our candy.

33 of 67

Decision trees

34 of 67

Decision trees: example

Age?

Student?

Credit rating?

<=30

>40

yes

31..40

yes

yes

no

no

yes

no

fair

excellent

35 of 67

Decision trees

Model: Flow-chart-like tree structure

    • Nodes are tests on a single attribute
    • Branches are attribute values of parent node
    • Leaves are marked with class labels

Goal: Find decision tree that maximizes classification accuracy on given dataset

Optimization:

  • NP-hard
  • Heuristic: greedy top-down tree construction + pruning

36 of 67

Decision tree induction

Tree construction (top-down divide-and-conquer strategy)

    • At the beginning, all training samples belong to the root
    • Examples are partitioned recursively based on selected “most discriminative” attributes (partitioning data into most homogeneous subsets)
    • Discriminative power based on information gain (in ID3 and C4.5 algorithms) or Gini impurity (in CART algorithm)

Partitioning stops if

    • All samples belong to the same class → assign the class label to the leaf
    • There are no attributes left → majority voting to assign the class label to the leaf

37 of 67

Decision tree induction

Age?

<=30

>40

yes

31..40

38 of 67

Decision tree induction

Age?

Student?

<=30

>40

yes

31..40

yes

yes

no

no

39 of 67

Decision tree induction

Age?

Student?

Credit rating?

<=30

>40

yes

31..40

yes

yes

no

no

yes

fair

no

excellent

40 of 67

Attribute selection

For a given branch in the tree, the set of samples S to be classified has�P positive and N negative samples

The amount of entropy in the set S is

Note that:

      • If P = 0 or N = 0: H(P, N) = 0 → no uncertainty
      • If P = N: H(P, N) = 1 → maximum uncertainty

P/(P+N)

H(P, N)

41 of 67

Attribute selection

HS = H(9, 5) = 0.94

Age [<=30] H(2, 3) = 0.97 Income [high] H(2, 2) = 1

Age [31...40] H(4, 0) = 0 Income [med] H(4, 2) = 0.92

Age [>40] H(3, 2) = 0.97 Income [low] H(3, 1) = 0.81

Student [yes] H(6, 1) = 0.59 Rating [fair] H(6, 2) = 0.81

Student [no] H(3, 4) = 0.98 Rating [exc] H(3, 3) = 1

42 of 67

Attribute selection

HS = H(9, 5) = 0.94

HAge = p([<=30]) ∙ H(2, 3) + p([31...40]) ∙ H(4, 0) + p([>40]) ∙ H(3, 2) =

= 5/14 ∙ 0.97 + 4/14 ∙ 0 + 5/14 ∙ 0.97 = 0.69

HIncome = p([high]) ∙ H(2, 2) + p([med]) ∙ H(4, 2) + p([low]) ∙ H(3, 1) =

= 4/14 ∙ 1 + 6/14 ∙ 0.92 + 4/14 ∙ 0.81 = 0.91

HStudent = p([yes]) ∙ H(6, 1) + p([no]) ∙ H(3, 4) = 7/14 ∙ 0.59 + 7/14 ∙ 0.98 = 0.78

HRating = p([fair]) ∙ H(6, 2) + p([exc]) ∙ H(3, 3) = 8/14 ∙ 0.81 + 6/14 ∙ 1 = 0.89

43 of 67

Attribute selection

Attribute A partitions S into S1, S2, ... Sv

Entropy of attribute A is

The information gain obtained by splitting S using A is

Gain(Age) = 0.94 – 0.69 = 0.25

Gain(Income) = 0.94 – 0.91 = 0.03

Gain(Student) = 0.94 – 0.78 = 0.16

Gain(Rating) = 0.94 – 0.89 = 0.05

← split on age

44 of 67

Pruning

The construction phase does not filter

out noise → overfitting

Many possible pruning strategies

      • Stop partitioning a node when the corresponding number of samples assigned to a leaf goes below a threshold
      • Bottom-up cross validation: Build the full tree and replace nodes with leaves labeled with the majority class if classification accuracy on a validation set (not seen during training!) does not get worse this way

45 of 67

Comments

Decision trees are an example of a classification algorithm

      • Many other out there (k-NN, naive Bayes, SVM, neural networks, logistic regression, random forests ...)

Maybe not the best one ...

      • Sensitive to small perturbation in the data (high variance)
      • Tend to overfit
      • Non-incremental: Need to be re-trained from scratch if new training data becomes available

46 of 67

Decision tree models

  • As tree depth increases, how do bias and variance change? (Hint: think about k-NN)

POLLING TIME

47 of 67

Decision tree models

  • As tree depth increases, bias decreases and variance generally increases. Why? (Hint: think about k-NN)

Bias decreases�with tree depth

Variance increases�with tree depth

48 of 67

Ensemble methods

Are, metaphorically, like “democratic” machine learning algorithms:

  • Take a collection of simple or weak learners
  • Combine their results to make a single, better learner

Types:

  • Bagging: train learners in parallel on different samples of the data, then combine by voting (for discrete output) or by averaging (for continuous output).
  • Boosting: train learner again, but after filtering/weighting samples based on output of previous train/test runs.
  • Stacking: combine outputs from various models using a second-stage learner (e.g., linear regression).

49 of 67

Random forests

Grow K trees on datasets sampled from the original dataset (size N) with replacement (bootstrap samples), p = number of features.

  • Draw K bootstrap samples of size N
  • Grow each decision tree by selecting a random set of m out of p features at each node and choosing the best feature to split on.
  • At testing time, aggregate the predictions of the trees (most popular vote, or average) to produce the final class (example of bagging).

Typically m might be e.g. sqrt(p), but can be smaller.

50 of 67

Random forests

Principles: we want to take a vote between different learners so we don’t want the models to be too similar. The following two criteria ensure diversity in the individual trees:

  • Draw K bootstrap samples of size N:
    • Each tree is trained on different data.
  • Grow a decision tree by selecting a random set of m out of p features at each node, and choosing the best feature to split on.
    • Corresponding nodes in different trees (usually) can’t use the same feature to split on.

51 of 67

Random forests

  • Very popular in practice, probably the most popular classifier for dense data (up to a few thousand features)
  • Easy to implement (simply train many normal decision trees)
  • Easy to parallelize

  • Needs many passes over the data – at least the max depth of the trees (<< boosted trees though, cf. next slide)

52 of 67

Boosted decision trees

  • A more recent alternative to random forests (RF) [good intro here]
  • In contrast to RF, whose trees are trained independently by bagging, BDT trees are trained sequentially by boosting: Each tree is trained to predict (“correct”) residual errors of previous trees (→ bias reduction).
  • Final prediction: sum of predictions made by individual trees.
  • Both RF and boosted trees can produce very high-quality models. Superiority of one method or the other is dataset-dependent.
  • Resource requirements are very different as well, so it’s actually non-trivial to compare the methods.

53 of 67

Random forests vs. boosted trees

  • The “geometry” of the methods is very different:
  • Random forests use 10s of deep, large trees:

Depth �10-30

10’s of trees

100k’s of nodes

Bias reduction

through depth

Variance reduction through the ensemble aggregate

54 of 67

Random forests vs. boosted trees

  • The “geometry” of the methods is very different:
  • Boosted decision trees use 1000s of shallow, small trees:

Depth �4–8

1000’s of trees

100’s of nodes

Bias reduction through boosting – variance already low

55 of 67

Random forests vs. boosted trees

  • RF training embarrassingly parallel, can be very fast
  • Evaluation of trees (runtime) also much faster for RFs

Depth �10-30

10’s of trees

100k’s of nodes

Depth �4-8

1000’s of trees

100’s of nodes

56 of 67

For your personal perusal:�“A visual introduction�to machine learning”

http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

57 of 67

Linear and logistic regression

58 of 67

Linear regression

  • Your good friend from lecture 5 on regression analysis
  • Goal: find the “best” line (linear function y = f(X)) to explain the data

X

y

59 of 67

How to model binary events?

  • E.g., X: student features; y: did student pass ADA?
  • Desired output: f(X) = probability of passing ADA, given feats X
  • Problem with linear regression:�f(X) can be below 0 or above 1

60 of 67

Logistic regression

  • Trick: don’t deal with probabilities, which range from 0 to 1,�but with log odds, which range from –∞ to +
  • Probability y ⇔ odds y/(1–y) ⇔ log odds log[y/(1–y)]
  • Model log odds as a linear function of X

Want this!

Bad!

61 of 67

Logistic regression

  • Model log odds as a linear function of X
  • βTX = log[y/(1–y)]
  • Solve for y: y = 1 / (1 + exp(–βTX))
  • Finding best model β via maximum likelihood:
    • Don’t use square loss as in linear regression (where y is assumed to be generated from Normal distribution)
    • Use cross-entropy loss instead (y assumed to be generated from Bernoulli distribution, i.e., biased coin)

“sigmoid”

62 of 67

Overfitting

  • The more features the better?
    • No!
    • More features mean less bias, but more variance
    • Overfitting
  • Carefully selected features can improve model accuracy
    • E.g., keep features that correlate with the label y
    • Forward/backward feature selection
    • Regularization (e.g., penalize norm of weight vector)
  • More on such practical aspects: next lecture (“applied ML”)

63 of 67

Feedback

63

Give us feedback on this lecture here: https://go.epfl.ch/ada2023-lec7-feedback

  • What did you (not) like about this lecture?
  • What was (not) well explained?
  • On what would you like more (fewer) details?
  • Where is Pumpkin Pete?

64 of 67

Criteria

Predictive performance (accuracy, AUC/ROC, precision, recall, F1-score, etc.)

Speed and scalability

    • Time to build the model
    • Time to use the model
    • In memory vs. on disk processing
    • Communication cost

Robustness

    • Handling noise, outliers, missing values

Interpretability

    • Understanding the model and its decisions (black box vs. white box)
  • Compactness of the model
    • Mobile and embedded devices

65 of 67

k-NN and the curse of dimensionality

The curse of dimensionality refers to “weird” phenomena that occur in high dimensions (100s to millions) that do not occur in low-dimensional (e.g. 3-dimensional) space.

In particular data in high dimensions are much sparser (less dense) than data in low dimensions.

For k-NN, this means there are fewer points that are very close in feature space (very similar) to the point X whose y we want to predict.

66 of 67

k-NN and the curse of dimensionality

From this perspective, it’s surprising that kNN works at all in high dimensions.

Luckily real data are not like random points in a high-dimensional cube. Instead they live in dense clusters and near much lower-dimensional surfaces.

Also, points can be very “similar” even if their Euclidean distance is large. E.g. documents with the same few dominant words are likely to be on the same topic (--> use different distance)

67 of 67

k-NN and the curse of dimensionality