1 of 76

1

Applied Data Analysis (CS401)

Robert West

Lecture 8

Supervised learning

2018/11/08

2 of 76

2

3 of 76

Announcements

3

  • HW2 grades released
  • HW3 peer reviews
    • due in one week (Thu, Nov 15, 23:59)
    • distributed one day later (Fri, Nov 16)
  • HW4 released today (due Wed, Nov 21)
  • Tomorrow’s lab session:
    • HW2 post mortem
    • HW4 prae vitam
    • Tutorial: Applied ML
    • Office hours (HW + projects)

4 of 76

Feedback

4

Give us feedback on this lecture here: https://go.epfl.ch/ada2018-lec8-feedback

  • What did you (not) like about this lecture?
  • What was (not) well explained?
  • On what would you like more (fewer) details?

5 of 76

Machine Learning

  • Supervised: We are given input/output samples (X, y) which we relate with a function y = f(X). We would like to “learn” f, and evaluate it on new data. Types:
    • Classification: y is discrete (class labels)
    • Regression: y is continuous, e.g. linear/logistic regression
  • Unsupervised: Given only samples X of the data, we compute a function f such that y = f(X) is a “simpler” representation.
    • Clustering: y is discrete
    • y is continuous: Matrix factorization, topic modeling, unsupervised neural networks, word2vec,

6 of 76

Machine Learning: Examples

  • Supervised:
    • Is this image a cat, dog, car, house?
    • How would this user score that restaurant?
    • Is this email spam?
    • Is this blob a supernova?
  • Unsupervised:
    • Cluster some handwritten digit data into 10 classes
    • What are the top 20 topics in Twitter right now?
    • Find the best 2D visualization of 1000-dimensional data

7 of 76

Machine Learning: Techniques

  • Supervised Learning:
    • kNN (k nearest neighbors)
    • Naïve Bayes
    • Linear + logistic regression
    • Support vector machines
    • Random forests
    • Supervised neural networks
    • etc.
  • Unsupervised Learning:
    • Clustering
    • Dimensionality reduction; e.g., topic modeling, PCA, MDS
    • Hidden Markov models (HMM)
    • etc.

8 of 76

Criteria

Predictive performance (accuracy, AUC/ROC, precision, recall, F1-score, etc.)

Speed and scalability

    • Time to build the model
    • Time to use the model
    • In memory vs. on disk processing
    • Communication cost

Robustness

    • Handling noise, outliers, missing values

Interpretability

    • Understanding the model and its decisions (black box vs. white box)
  • Compactness of the model
    • Mobile and embedded devices

9 of 76

Intro to supervised learning:

k nearest neighbors (kNN)

10 of 76

k Nearest Neighbors

Given a query item:�Find k closest matches�in a labeled dataset ↓

11 of 76

k Nearest Neighbors

Given a query item: Return the most�Find k closest matches Frequent label

12 of 76

k-Nearest Neighbors

k = 3 votes for “cat”

13 of 76

k-NN issues

The data is the model

  • No training needed.
  • Accuracy generally improves with more data.
  • Matching is simple and fairly fast if data fits in memory.
  • Usually need data in memory, but can be run off disk.

Minimal configuration:

  • Only parameter is k (number of neighbors)
  • But two other choices are important:
    • Weighting of neighbors (e.g. inverse distance)
    • Similarity metric

14 of 76

k-NN Flavors

Classification:

  • Model is y = f(X), y is from a discrete set (labels).
  • Given X, compute y = majority vote of the k nearest neighbors.
  • Can also use a weighted vote* of the neighbors.

Regression:

  • Model is y = f(X), y is a real value.
  • Given X, compute y = average value of the k nearest neighbors.
  • Can also use a weighted average* of the neighbors.

* Weight function is usually the inverse distance.

15 of 76

k-NN distance measures

 

16 of 76

k-NN metrics

17 of 76

stack.push(kNN)

17

18 of 76

Predicting from Samples

  • Most datasets are samples from an infinite population.
  • We are most interested in models of the population, but we have access only to a sample of it.

For datasets consisting of (X,y)

  • features X, label y

For the true model f:

  • y = f(X)

We train on a training sample D�and we denote the model as fD(X)�

19 of 76

Bias and Variance

 

20 of 76

21 of 76

Bias and Variance Tradeoff

There is usually a bias-variance tradeoff caused by model complexity.

Complex models (many parameters) usually have lower bias, but higher variance.

Simple models (few parameters) have higher bias, but

lower variance.

22 of 76

Bias and Variance Tradeoff

e.g. a linear model can only fit a straight line. A high-degree polynomial can fit a complex curve. But the polynomial can fit the individual sample, rather than the population. Its shape can vary from sample to sample, so it has high variance.

23 of 76

Bias and Variance Tradeoff

 

24 of 76

kNN = stack.pop()

24

25 of 76

Choosing k for k nearest neighbors

We have a bias/variance tradeoff:

  • Small k → ?
  • Large k → ?

26 of 76

Choosing k

  • Small k → low bias, high variance
  • Large k → high bias, low variance
  • Assume the real data follows the blue curve, with some mean-zero additive noise. Red points are a data sample.

X

y

y = f(X)

27 of 76

Choosing k

  • Small k → low bias, high variance
  • Large k → high bias, low variance

X

y

k=1

Bias = average�of this offset

Points from�original sample

y = f(X)

28 of 76

Choosing k

  • Small k → low bias, high variance
  • Large k → high bias, low variance

X

k=1

y

Bias = average�of this offset

Points from a �different sample

y = f(X)

29 of 76

Choosing k

  • Small k → low bias, high variance
  • Large k → high bias, low variance

X

y

k=8

Bias = average�of this offset

y = f(X)

30 of 76

Choosing k

  • Small k → low bias, high variance
  • Large k → high bias, low variance

X

y

k=8

Bias = average�of this offset

y = f(X)

31 of 76

Choosing k in practice

Use leave-one-out cross-validation (LOO): Break data into train and test subsets, e.g. 80-20 % random split.

Predict: For each point in the training set, predict using the k nearest neighbors from the set of all other points in training set. Measure the error rate (classification) or squared error (regression).

Tune: try different values of k, and use the one that gives minimum leave-one-out error

Evaluate: test on the test set to measure performance.

32 of 76

kNN and the curse of dimensionality

The curse of dimensionality refers to phenomena that occur in high dimensions (100s to millions) that do not occur in low-dimensional (e.g. 3-dimensional) space.

In particular data in high dimensions are much sparser (less dense) than data in low dimensions.

For kNN, that means there are fewer points that are very close in feature space (very similar), to the point we want to predict.

33 of 76

kNN and the curse of dimensionality

 

34 of 76

kNN and the curse of dimensionality

From this perspective, it’s surprising that kNN works at all in high dimensions.

Luckily real data are not like random points in a high-dimensional cube. Instead they live in dense clusters and near much lower-dimensional surfaces.

Also, points can be very “similar” even if their Euclidean distance is large. E.g. documents with the same few dominant words are likely to be on the same topic (--> use different distance)

35 of 76

Decision Trees

36 of 76

Decision trees: Example

Age?

Student?

Credit rating?

<=30

>40

yes

31..40

yes

yes

no

no

yes

no

fair

excellent

37 of 76

Decision trees

Model: flow-chart-like tree structure

    • Nodes are tests on a single attribute
    • Branches are attribute values
    • Leaves are marked with class labels

Score function: classification accuracy

Optimization:

  • NP-hard
  • Heuristic: greedy top-down tree construction + pruning

38 of 76

Decision tree induction

Tree construction (top-down divide-and-conquer strategy)

    • At the beginning, all training samples belong to the root
    • Examples are partitioned recursively based on selected “most discriminative” attributes
    • Discriminative power based on information gain (ID3/C4.5) or Gini impurity (CART)

Partitioning stops if

    • All samples belong to the same class → assign the class label to the leaf
    • There are no attributes left → majority voting to assign the class label to the leaf

39 of 76

Decision tree induction

Age?

<=30

>40

yes

31..40

40 of 76

Decision tree induction

Age?

Student?

<=30

>40

yes

31..40

yes

yes

no

no

41 of 76

Decision tree induction

Age?

Student?

Credit rating?

<=30

>40

yes

31..40

yes

yes

no

no

yes

fair

no

excellent

42 of 76

Attribute selection

At a given branch in the tree, the set of samples S to be classified has P positive and N negative samples

The amount of entropy in the set S is

Note that:

      • If P=0 (or N=0), H(P, N) = 0 → no uncertainty
      • If P=N, H(P, N) = 1 → max uncertainty

43 of 76

Attribute selection

HS = H(9, 5) = 0.94

Age [<=30] H(2, 3) = 0.97 Income [high] H(2, 2) = 1

Age [31...40] H(4, 0) = 0 Income [med] H(4, 2) = 0.92

Age [>40] H(3, 2) = 0.97 Income [low] H(3, 1) = 0.81

Student [yes] H(6, 1) = 0.59 Rating [fair] H(6, 2) = 0.81

Student [no] H(3, 4) = 0.98 Rating [exc] H(3, 3) = 1

44 of 76

Attribute selection

HS = H(9, 5) = 0.94

HAge = p([<=30]) ∙ H(2, 3) + p([31...40]) ∙ H(4, 0) + p([>40]) ∙ H(3, 2) =

= 5/14 ∙ 0.97 + 4/14 ∙ 0 + 5/14 ∙ 0.97 = 0.69

HIncome = p([high]) ∙ H(2, 2) + p([med]) ∙ H(4, 2) + p([low]) ∙ H(3, 1) =

= 4/14 ∙ 1 + 6/14 ∙ 0.92 + 4/14 ∙ 0.81 = 0.91

HStudent = p([yes]) ∙ H(6, 1) + p([no]) ∙ H(3, 4) = 7/14 ∙ 0.59 + 7/14 ∙ 0.98 = 0.78

HRating = p([fair]) ∙ H(6, 2) + p([exc]) ∙ H(3, 3) = 8/14 ∙ 0.81 + 6/14 ∙ 1 = 0.89

45 of 76

Attribute selection

Attribute A partitions S into S1, S2, ... Sv

Entropy of attribute A is

The information gain obtained by splitting S using A is

Gain(Age) = 0.94 – 0.69 = 0.25

Gain(Income) = 0.94 – 0.91 = 0.03

Gain(Student) = 0.94 – 0.78 = 0.16

Gain(Rating) = 0.94 – 0.89 = 0.05

← split on age

46 of 76

Pruning

The construction phase does not filter out noise → overfitting

Many possible pruning strategies

      • Stop partitioning a node when the corresponding number of samples assigned to a leaf goes below a threshold
      • Bottom-up cross validation: Build the full tree and replace nodes with leaves labeled with the majority class if classification accuracy on validation set (!) does not get worse this way

47 of 76

Comments

Decision trees are just an example of classification algorithm

      • Many other out there (Naive Bayes, SVM, neural networks, logistic regression, kNN, random forest ...)

Maybe not the best one ...

      • Sensitive to small perturbation in the data
      • Tend to overfit
      • Non-incremental: Need to be re-trained from scratch if new training data becomes available

48 of 76

Decision Tree Models

  • As tree depth increases, bias decreases and variance generally increases. Why? (Hint: think about k-NN)

Bias decreases�with tree depth

Variance increases�with tree depth

49 of 76

Ensemble Methods

Are like crowdsourced machine learning algorithms:

  • Take a collection of simple or weak learners
  • Combine their results to make a single, better learner

Types:

  • Bagging: train learners in parallel on different samples of the data, then combine by voting (discrete output) or by averaging (continuous output).
  • Stacking: combine model outputs using a second-stage learner like linear regression.
  • Boosting: train learner again, but after filtering/weighting samples based on output of previous train/test runs.

50 of 76

Random Forests

Grow K trees on datasets sampled from the original dataset (size N) with replacement (bootstrap samples), p = number of features.

  • Draw K bootstrap samples of size N
  • Grow each Decision Tree, by selecting a random set of m out of p features at each node, and choosing the best feature to split on.
  • Aggregate the predictions of the trees (most popular vote, or average) to produce the final class (example of bagging).

Typically m might be e.g. sqrt(p) but can be smaller.

51 of 76

Random Forests

Principles: we want to take a vote between different learners so we don’t want the models to be too similar. These two criteria ensure diversity in the individual trees:

  • Draw K bootstrap samples of size N:
    • Each tree is trained on different data.
  • Grow a Decision Tree, by selecting a random set of m out of p features at each node, and choosing the best feature to split on.
    • Corresponding nodes in different trees (usually) can’t use the same feature to split.

52 of 76

Random Forests

  • Very popular in practice, probably the most popular classifier for dense data (up to a few thousand features)
  • Easy to implement (simply train many normal decision trees)
  • Parallelizes easily

  • Needs many passes over the data – at least the max depth of the trees (<< boosted trees though, cf. next slide)

53 of 76

Boosted Decision Trees

  • A more recent alternative to Random Forests [good intro here]
  • In contrast to RFs whose trees are trained independently, BDT trees are trained sequentially by boosting: Each tree is trained to predict error residuals of previous trees (--> bias reduction).
  • Both RF and boosted trees can produce very high-quality models. Superiority of one method or the other is very dataset-dependent.
  • Resource requirements are very different as well, so it’s actually non-trivial to compare the methods.

54 of 76

Random Forests vs. Boosted Trees

  • The “geometry” of the methods is very different:
  • Random forests use 10’s of deep, large trees:

Depth �20-30

10’s of trees

100k’s of nodes

Bias reduction

through depth

Variance reduction through the ensemble aggregate

55 of 76

Random Forests vs Boosted Trees

  • The “geometry” of the methods is very different:
  • Boosted decision trees use 1000’s of shallow, small trees:

Depth �4-8

1000’s of trees

100’s of nodes

Bias reduction through boosting – variance already low

56 of 76

Random Forests vs Boosted Trees

  • RF training embarrassingly parallel, can be very fast
  • Evaluation of trees (runtime) also much faster for RFs

Depth �20-30

10’s of trees

100k’s of nodes

Depth �4-8

1000’s of trees

100’s of nodes

57 of 76

On model transparency

DNNs are often considered as “opaque” boxes

Do you think it’s easier to interpret a model with 1000s of trees?!

58 of 76

59 of 76

Linear and logistic regression

60 of 76

Linear Regression

We want to find the “best” line (linear function y=f(X)) to explain the data.

X

y

61 of 76

Linear Regression

 

62 of 76

Least Squares Solution

 

63 of 76

Least Squares Solution

64 of 76

How to model binary events?

  • E.g., X: student features; y: did student pass ADA?
  • Desired output: f(X) = probability of passing ADA, given feats X
  • Problem with linear regression:�f(X) can be below 0 or above 1

65 of 76

Logistic regression

  • Trick: don’t deal with probabilities, which range from 0 to 1,�but with log odds, which range from -inf to +inf
  • Probability y ⇔ odds y/(1-y) ⇔ log odds log[y/(1-y)]
  • Model log odds as a linear function of X

Want this!

Bad!

66 of 76

Logistic regression

  • Model log odds as a linear function of X
  • βTX = log[y/(1-y)]
  • Solve for y: y = 1 / (1 + exp(-βTX))
  • Finding best model β via maximum likelihood:
    • Don’t use square loss as in linear regression
    • Use cross-entropy loss instead

“sigmoid”

67 of 76

Overfitting

  • The more features the better?
    • NO!
    • More features mean less bias, but more variance
    • Overfitting
  • Carefully selected features can improve model accuracy
    • E.g., keep features that correlate with the label y
    • Forward/backward feature selection
    • Regularization (e.g., penalize norm of weight vector)

68 of 76

Feedback

68

Give us feedback on this lecture here: https://go.epfl.ch/ada2018-lec8-feedback

  • What did you (not) like about this lecture?
  • What was (not) well explained?
  • On what would you like more (fewer) details?

69 of 76

Classification

training

set

IF rank = “professor”

AND years > 6

THEN tenured = “yes”

categorical attributes

numerical

attribute

class

label

classification

algorithm

classifier

(model)

70 of 76

Linear Regression

 

71 of 76

R2-values and P-values

We can always fit a linear model to any dataset, but how do we know if there is a real linear relationship?

72 of 76

R2-values

 

73 of 76

R-squared

 

X

y

 

 

Small if good fit

74 of 76

R2-values and P-values

Statistic: From R-squared we can derive another statistic (using degrees of freedom) that has a standard distribution called an F-distribution.

From the CDF for the F-distribution, we can derive a P-value for the data.

The P-value is, as usual, the probability of observing the data under the null hypothesis of no linear relationship.

If p is small, say less than 0.05, we conclude that there is a linear relationship.

75 of 76

Ex. of correlation (and impact of each feature) explained via regression

76 of 76