1 of 73

1

Applied Data Analysis (CS401)

Maria Brbic

Lecture 8

Learning from data:

Applied machine learning

5 Nov 2025

2 of 73

Announcements

  • Homework H1: will be released today!
  • Project milestone P2 due today 23:59
  • Friday’s lab session:
    • Exercise on applied machine learning (Exercise7)

2

3 of 73

Feedback

3

Give us feedback on this lecture here: https://go.epfl.ch/ada2025-lec8-feedback

  • What did you (not) like about this lecture?
  • What was (not) well explained?
  • On what would you like more (fewer) details?

4 of 73

Why an extra class on applied ML?

4

Classic ML

class

ADA

5 of 73

Classification pipeline

5

Data collection

Model assessment

Model selection

6 of 73

Data collection

The first step is collecting data related to the classification task.

    • Definition of the attributes (or features) that describe a data item and the class label.

Domain knowledge is needed.

What if assigning the class label would be too time-consuming or even impossible?

    • Unsupervised methods (e.g., clustering); cf. next lecture!

6

7 of 73

Data collection

7

Identification of features

Removing irrelevant features

Unsupervised/ supervised discretization

Discretization?

Normalization?

Standardization etc.

Y

N

Y

N

Class label available?

Y

N

Data labeling

8 of 73

Features

Different types of features [more]

    • Continuous (e.g., height, temperature ...)
    • Ordinal (e.g., “agree”, “don’t care”, “disagree” ...)
    • Categorical (e.g., country, gender ...)

New features can be generated from simple stats

    • Feature engineering is considered a form of art, therefore it is often useful to find what other people did for similar problems

Some classifiers require categorical features => discretization

8

9 of 73

ML before 2012*

(but still very common today)

* Before publication of Krizhevsky et al.’s ImageNet CNN paper

9

Cleverly designed�features

Input data

ML model

Much of the “heavy lifting” in here.

Final performance only as good as the�feature set.

10 of 73

A typical ML approach after 2012

10

Features

Input data

Model

Deep learning

Features and model learned together,�mutually reinforcing

EE-559: Deep learning

CS552: Modern natural language processing

CS-502: Deep learning in biomedicine

CS-456: Deep reinforcement learning

etc

11 of 73

Data collection

11

Identification of features

Removing irrelevant features

Unsupervised/ supervised discretization

Discretization?

Normalization?

Standardization etc.

Y

N

Y

N

Class label available?

Y

N

Data labeling

12 of 73

Labels

Collecting a lot of data (features) is often easy.

Labeling data is time consuming, difficult, and sometimes even impossible.

12

Label: “Is page credible?”

Human dietary expert is needed

13 of 73

Potential labelers

  • You
  • Older days:
    • Undergraduate students
    • Domain experts ($$$)
  • Now: crowdsourcing
    • Can get both amateurs (~ undergrad students) and experts

13

14 of 73

14

Requester

Crowdsourcing

platform

1. Submit task

2. Accept task

Crowd workers

3. Return answers

4. Collect answers

“Is this webpage

credible?”

Credible

Credible

Not credible

Credible

Not credible

15 of 73

Data collection

15

Identification of features

Removing irrelevant features

Unsupervised/ supervised discretization

Discretization?

Normalization?

Standardization etc.

Y

N

Y

N

Class label available?

Y

N

Data labeling

16 of 73

Discretization

Why?

  • Some classifiers want discrete features (e.g., simplest kinds of decision trees)
  • Discrete features let a linear classifier learn non-linear decision boundaries
  • Certain feature selection methods require discrete (or even binary) features

16

17 of 73

Discretization

Unsupervised

  • Equal width
    • Divide the range into a predefined number of bins (bad for skewed data, e.g., from a power law)
  • Equal frequency
    • Divide the range into a predefined number of bins so that every interval contains the same number of values
  • Clustering

17

18 of 73

Discretization

Supervised

  • Start with very fine-grained discretization
  • Test the hypothesis that membership in two adjacent intervals of a feature is independent of the class
  • If they are independent, they should be merged
  • Otherwise they should remain separate
  • Independence test: χ2 test (“chi-squared test”) [example]
  • Continue recursively

18

19 of 73

Data collection

19

Identification of features

Removing irrelevant features

Unsupervised/ supervised discretization

Discretization?

Normalization?

Standardization etc.

Y

N

Y

N

Class label available?

Y

N

Data labeling

20 of 73

Removing irrelevant features: Feature selection

  • Goal: reduce set of N features to a subset of the best size M < N
  • Why?
    • More interpretability
    • Less danger of overfitting
    • More efficient training
  • Problem: There are 2N possible subsets
  • Solutions:
    • Filtering as preprocessing (“offline”)
    • Iterative feature selection (“online”)

20

21 of 73

Offline feature selection

Rank features according to their individual predictive power; then select the best ones

Pros:

      • Independent of the classifier (performed only once)

Cons:

      • Independent of the classifier (ignore interaction with the classifier)
      • Assumes features are independent

21

22 of 73

Ranking of features

Continuous features (and ideally labels):

    • Pearson’s correlation coefficient (capturing strength of linear [!] dependence)

Categorical features and labels:

    • Mutual information (goes beyond linear dependence)

22

23 of 73

Ranking of features

Categorical features and labels (cont’d):

    • χ2 method (“chi-squared”)
      • Similar to χ2 method for feature discretization
      • Test whether feature is independent of label
      • Difference w.r.t. mutual information: the χ2 test checks the independence of the class and the feature, without indicating the strength or direction of any existing relationship (you just get a significance, a.k.a. p-value)

23

24 of 73

Ranking of features

Beware: collectively relevant features may look individually irrelevant!

24

25 of 73

Ranking of features

Beware: collectively relevant features may look individually irrelevant!

25

26 of 73

Online feature selection

Forward feature selection: greedily add

features; evaluate on validation dataset; stop when no improvement

Pros

      • Interact with the classifier
      • No feature-independence assumption

Cons

      • Computationally intensive

26

27 of 73

Online feature selection

Backward feature selection: greedily remove features; evaluate on validation dataset; stop when no improvement

Pros

      • Interact with the classifier
      • No feature-independence assumption

Cons

      • Computationally intensive

27

28 of 73

Data collection

28

Identification of features

Removing irrelevant features

Unsupervised/ supervised discretization

Discretization?

Normalization?

Standardization etc.

Y

N

Y

N

Class label available?

Y

N

Data labeling

29 of 73

Feature normalization

  • Some classifiers can’t easily handle features with very different scales, e.g.,
    • Revenue in CHF: 10,000,000
    • # of employees: 300
  • Features with large values dominate the others, and the classifier tends to over-optimize for them
  • Even single feature may span many orders of magnitude
    • e.g., city size (most cities small, some huge)
    • Too little resolution where data is dense, too much resolution where data is sparse

29

30 of 73

Logarithmic scaling

xi’ = log(xi)

  • Consider order of magnitude, rather than direct value
  • Good for heavy-tailed features (e.g., from power laws)

30

31 of 73

Min-max scaling

xi’ = (ximi)/(Mimi)

where Mi and mi are the max and min values of feature xi respectively

The new feature xi’ lies in the interval [0,1]

31

32 of 73

Standardization

xi’ = (xi – μi)/σi

where μi is the mean value of feature xi, and σi is the standard deviation

The new feature xi’ has mean 0 and standard deviation 1

32

33 of 73

Dangers of standardization and scaling

Standardization:

  • Assume that the data has been generated by a Gaussian
  • Uses mean and std → not meaningful for heavy-tailed data (may be mitigated by log-scaling)

Min-max scaling:

  • If the data has outliers, they scale the typical values to a very small interval

33

34 of 73

34

Commercial break

35 of 73

Classification pipeline

35

Data collection

Model assessment

Model selection

36 of 73

Model selection: high level

Need to choose type of model

  • k-NN?
  • Decision trees?
  • Random forest?
  • Boosted decision trees?
  • Logistic regression?
  • Deep learning?

36

37 of 73

Model selection: low level

Usually a classifier has some “hyperparameters” to be tuned

    • Set of features to include
    • Distance function (e.g., k-NN)
    • Number of neighbors (e.g., k-NN)
    • Number of trees (e.g., random forest)
    • Decision threshold (e.g., logistic regression)
    • Regularization parameter
    • Learning rate (for gradient descent algorithms)

37

38 of 73

Loss function (more of them later!)

Categorical output

    • e.g., 0-1 loss function, risk (= 1 minus accuracy):

Real-valued output

    • e.g., squared error:

    • e.g., absolute error:

38

39 of 73

Model selection: on what data to evaluate?

What if you can’t afford a 3-way split because you have too little data?

Cross-validation! (p.t.o.)

39

Full dataset

Training set

Testing set

Validation set

40 of 73

Cross-validation

40

  • Last lecture: leave-one-out cross-validation == N-fold cross-validation (where N is #data points)
  • More efficient: m-fold cross-validation (in above picture: m = 5)
  • Average performance over the m red portions → validation error
  • Repeat procedure for all candidate models and pick the one with the lowest validation error

41 of 73

Model selection

41

Hyperparameter value

Validation error

42 of 73

Performance metrics for binary classification

For categorical binary classification, the usual metrics are based on the confusion matrix, which has 4 values:

    • True Positives (positive examples classified as positive)
    • True Negatives (negative examples classified as negative)
    • False Positives (negative examples classified as positive)
    • False Negatives (positive examples classified as negative)

42

Class

Pos

Neg

Classified

Pos

TP

FP

Neg

FN

TN

43 of 73

Accuracy

Appropriate metric when

    • classes are not skewed
    • errors (FP, FN) have the same importance

43

44 of 73

Accuracy (skewed example)

Accuracy = 85/100 = 85%

Accuracy = 90/100 = 90%

44

Classifier 1

Class

Fraud

¬Fraud

Classified

Fraud

5

10

¬Fraud

5

80

Always ¬Fraud

Class

Fraud

¬Fraud

Classified

Fraud

0

0

¬Fraud

10

90

45 of 73

Poll time

Which classifier is better?

  • Classifier 1
  • Classifier 2
  • Both are equally good

45

Classifier 1

Class

Cancer

¬Cancer

Classified

Cancer

45

20

¬Cancer

5

30

Classifier 2

Class

A

B

Classified

Cancer

40

10

¬Cancer

10

40

POLLING TIME

  • Scan QR code or go to�https://go.epfl.ch/ada2025-lec8-poll

100 data points

100 data points

46 of 73

Poll time

Which classifier is better?

    • Classifier 1
    • Classifier 2
    • Both are equally good

46

Classifier 1

Class

Cancer

¬Cancer

Classified

Cancer

45

20

¬Cancer

5

30

Classifier 2

Class

Cancer

¬Cancer

Classified

Cancer

40

10

¬Cancer

10

40

47 of 73

Precision and recall

Precision: what fraction of positivepredictions are actually positive?

Recall: what fraction of actually positive examples did I recognize as such?

47

48 of 73

Precision and recall

P1=45/65=0.69 P2=40/50=0.8

R1=45/50=0.9 R2=40/50=0.8

48

Classifier 1

Class

Cancer

¬Cancer

Classified

Cancer

45

20

¬Cancer

5

30

Classifier 2

Class

Cancer

¬Cancer

Classified

Cancer

40

10

¬Cancer

10

40

Everybody has cancer

Class

Cancer

¬Cancer

Classified

Cancer

50

50

¬Cancer

0

0

P = 50/100 = 0.5

R = 50/50 = 1

49 of 73

F-score

Sometimes it’s necessary to have a single�metric to compare classifiers

F-score (or F1-score): harmonic mean of precision and recall

Precision and recall can be differently weighted, if one is more important than the other

49

F1 = 1 / (0.5 * (1/P + 1/R)) =

50 of 73

Precision and recall

F1=2*(0.69*0.9)/(0.69+0.9) F2=2*(0.8*0.8)/(0.8+0.8)

= 0.78 =0.8

50

Classifier 1

Class

Cancer

¬Cancer

Classified

Cancer

45

20

¬Cancer

5

30

Classifier 2

Class

Cancer

¬Cancer

Classified

Cancer

40

10

¬Cancer

10

40

Everybody has cancer

Class

Cancer

¬Cancer

Classified

Cancer

50

50

¬Cancer

0

0

F=2*(0.5*1)/(0.5+1)=0.66

51 of 73

Precision/recall curve

51

Decreasing classification threshold

52 of 73

ROC curve

ROC = Receiver-Operating Characteristic (WTF?!)

Y-axis: true-positive rate = TP/(TP + FN), a.k.a. recall

X-axis: false-positive rate = FP/(FP + TN)

52

Decreasing classification threshold

False-positive rate

True-positive rate

53 of 73

ROC AUC

ROC AUC is the “area under the curve” – a single number that captures the overall quality of the classifier. It should be between 0.5 (random classifier) and 1.0 (perfect).

53

Random ordering

area = 0.5

False-positive rate

True-positive rate

54 of 73

Bias and variance

54

55 of 73

55

Validation Error

56 of 73

How to know where on the x-axis you are (without fiddling with model complexity)?

56

Validation Error

57 of 73

Bias and variance

Bias and variance can be assessed by comparing the error metric on the training set and the validation set => always plot learning curves (training set size vs. training/validation errors)

57

Training error

Validation error

58 of 73

When more�data helps

58

High bias

More data doesn’t help

High variance

More data helps

Fixed data set size�Varying model complexity

Fixed model complexity�Varying data set size

For curious ADAventurers:�“Reconciling modern machine-learning practice and the classical bias–variance trade-off”

https://www.pnas.org/doi/abs/10.1073/pnas.1903070116

Training error

Validation error

Training error

Validation error

59 of 73

Classification pipeline

59

Data collection

Model assessment

Model selection

60 of 73

Model assessment

  • Model assessment is the goal of estimating the performance of a fixed model (i.e., the best model found during model selection)
  • Ideally under real-world conditions
  • Use held-out test set that you’ve never seen during training

60

61 of 73

Useful reads

61

62 of 73

62

63 of 73

Feedback

63

Give us feedback on this lecture here: https://go.epfl.ch/ada2025-lec8-feedback

  • What did you (not) like about this lecture?
  • What was (not) well explained?
  • On what would you like more (fewer) details?

64 of 73

Crowdsourcing

Different types of workers

    • Truthful
      • Expert
      • Normal
    • Untruthful
      • Uniform spammer
      • Random spammer
      • Malicious spammer (a.k.a. a$s#*le)

64

Random Spammer

Uniform

Spammer

Malicious

Spammer

Expert

Normal

Worker

True negative rate

True positive rate

Uniform

Spammer

65 of 73

Catching malicious spammers

  • Insert obvious examples for which you already know the labels (“honeypot”)
    • Tell workers they won’t be paid if they don’t get those right
    • Filter out workers who don’t get them right
  • Aggregate multiple answers
    • p.t.o.

65

66 of 73

Crowdsourcing

Answer aggregation problem

  • Have each example labeled by several workers, aggregate:
    • e.g., majority vote (works if only a minority is malicious)
    • e.g., “peer prediction”, “prediction markets” (game theory: workers are paid more if they agree with others)

66

Crowd (workers)

Worker

Webpage

Credible

W1

www.diet.com

C

W2

www.diet.com

¬C

W3

www.diet.com

C

...

...

...

Aggregation

www.diet.com

C

67 of 73

Recap

  • Model selection:
    • Use training data in cross-validation
    • Need evaluation metric
      • Typically based on confusion matrix
      • e.g., accuracy, precision, recall

67

Data collection

Model assessment

Model selection

Class

A

B

Classified

A

TP

FP

B

FN

TN

68 of 73

Model selection

Need evaluation metric!

68

Split dataset into “training” and “validation”

Evaluate classifier with validation set

Train classifier with training set

Performance acceptable?

Y

N

Set classifier parameters

69 of 73

Fwd-selected features vs. performance

69

70 of 73

Training and testing with heaps of data

70

database D

training set

test set

60% of D

40% of D

learn model

evaluate model

performance metric

71 of 73

Data-efficient training and testing:

Leave-one-out cross-validation

71

database D

training set

test set

(N-1)/N of D

1/N of D

learn model

evaluate model

Repeat N times

average of N runs

72 of 73

Data-efficient training and testing:

k-fold cross validation

72

database D

training set

test set

(k-1)/k of D

1/k of D

learn model

evaluate model

average of k runs

Repeat k times

73 of 73

More data often beats better algorithms

73