1 of 73

1

Applied Data Analysis (CS401)

Maria Brbic

Lecture 8

Learning from data:

Applied machine learning

5 Nov 2025

2 of 73

Announcements

Homework H1: will be released today!
Project milestone P2 due today 23:59
Friday’s lab session:

Exercise on applied machine learning (Exercise7)

2

3 of 73

Feedback

3

Give us feedback on this lecture here: https://go.epfl.ch/ada2025-lec 8-feedback

What did you (not) like about this lecture?
What was (not) well explained?
On what would you like more (fewer) details?
…

4 of 73

Why an extra class on applied ML?

4

link

Classic ML

class

ADA

5 of 73

Classification pipeline

5

Data collection

Model assessment

Model selection

6 of 73

Data collection

The first step is collecting data related to the classification task.

Definition of the attributes (or features) that describe a data item and the class label.

Domain knowledge is needed.

What if assigning the class label would be too time-consuming or even impossible?

Unsupervised methods (e.g., clustering); cf. next lecture!

6

7 of 73

Data collection

7

Identification of features

Removing irrelevant features

Unsupervised/ supervised discretization

Discretization?

Normalization?

Standardization etc.

Y

N

Y

N

Class label available?

Y

N

Data labeling

8 of 73

Features

Different types of features [more]

Continuous (e.g., height, temperature ...)
Ordinal (e.g., “agree”, “don’t care”, “disagree” ...)
Categorical (e.g., country, gender ...)

New features can be generated from simple stats

Feature engineering is considered a form of art, therefore it is often useful to find what other people did for similar problems

Some classifiers require categorical features => discretization

8

9 of 73

ML before 2012*

(but still very common today)

* Before publication of Krizhevsky et al.’s ImageNet CNN paper

9

Cleverly designed�features

Input data

ML model

Much of the “heavy lifting” in here.

Final performance only as good as the�feature set.

10 of 73

A typical ML approach after 2012

10

Features

Input data

Model

Deep learning

Features and model learned together,�mutually reinforcing

EE-559: Deep learning

CS552: Modern natural language processing

CS-502: Deep learning in biomedicine

CS-456: Deep reinforcement learning

etc

11 of 73

Data collection

11

Identification of features

Removing irrelevant features

Unsupervised/ supervised discretization

Discretization?

Normalization?

Standardization etc.

Y

N

Y

N

Class label available?

Y

N

Data labeling

12 of 73

Labels

Collecting a lot of data (features) is often easy.

Labeling data is time consuming, difficult, and sometimes even impossible.

12

Label: “Is page credible?”

Human dietary expert is needed

13 of 73

Potential labelers

You
Older days:

Undergraduate students
Domain experts ($$$)

Now: crowdsourcing

Can get both amateurs (~ undergrad students) and experts

13

14 of 73

14

Requester

Crowdsourcing

platform

1. Submit task

2. Accept task

Crowd workers

3. Return answers

4. Collect answers

“Is this webpage

credible?”

Credible

Not credible

Credible

Not credible

15 of 73

Data collection

15

Identification of features

Removing irrelevant features

Unsupervised/ supervised discretization

Discretization?

Normalization?

Standardization etc.

Y

N

Y

N

Class label available?

Y

N

Data labeling

16 of 73

Discretization

Why?

Some classifiers want discrete features (e.g., simplest kinds of decision trees)
Discrete features let a linear classifier learn non-linear decision boundaries
Certain feature selection methods require discrete (or even binary) features

16

17 of 73

Discretization

Unsupervised

Equal width

Divide the range into a predefined number of bins (bad for skewed data, e.g., from a power law)

Equal frequency

Divide the range into a predefined number of bins so that every interval contains the same number of values

Clustering

17

Unsupervised discretisation does not take class information into account.
The obvious weakness of the equal-width method is that in cases where the outcome observations are not distributed evenly, a large amount of important information can be lost after the discretization process
For equal-frequency, many occurrences of a continuous value could cause the occurrences to be assigned into different bins. One improvement can be after continuous values are assigned into bins, boundaries of every pair of neighbouring bins are adjusted so that duplicate values should belong to one bin only.
The advantage of using clustering as discretisation is that it can be performed on all the features at the same time, capturing in this way possible interdependencies of the features.
Sometimes discretisation can even improve the performance of algorithms that do not need discretisation

18 of 73

Discretization

Supervised

Start with very fine-grained discretization
Test the hypothesis that membership in two adjacent intervals of a feature is independent of the class
If they are independent, they should be merged
Otherwise they should remain separate
Independence test: χ² test (“chi-squared test”) [example]
Continue recursively

18

19 of 73

Data collection

19

Identification of features

Removing irrelevant features

Unsupervised/ supervised discretization

Discretization?

Normalization?

Standardization etc.

Y

N

Y

N

Class label available?

Y

N

Data labeling

20 of 73

Removing irrelevant features: Feature selection

Goal: reduce set of N features to a subset of the best size M < N
Why?

More interpretability
Less danger of overfitting
More efficient training

Problem: There are 2^N possible subsets
Solutions:

Filtering as preprocessing (“offline”)
Iterative feature selection (“online”)

20

21 of 73

Offline feature selection

Rank features according to their individual predictive power; then select the best ones

Pros:

Independent of the classifier (performed only once)

Cons:

Independent of the classifier (ignore interaction with the classifier)
Assumes features are independent

21

22 of 73

Ranking of features

Continuous features (and ideally labels):

Pearson’s correlation coefficient (capturing strength of linear [!] dependence)

Categorical features and labels:

Mutual information (goes beyond linear dependence)

22

23 of 73

Ranking of features

Categorical features and labels (cont’d):

χ² method (“chi-squared”)

Similar to χ²method for feature discretization
Test whether feature is independent of label
Difference w.r.t. mutual information: the χ² test checks the independence of the class and the feature, without indicating the strength or direction of any existing relationship (you just get a significance, a.k.a. p-value)

23

24 of 73

Ranking of features

Beware: collectively relevant features may look individually irrelevant!

24

25 of 73

Ranking of features

Beware: collectively relevant features may look individually irrelevant!

25

26 of 73

Online feature selection

Forward feature selection: greedily add

features; evaluate on validation dataset; stop when no improvement

Pros

Interact with the classifier
No feature-independence assumption

Cons

Computationally intensive

26

27 of 73

Online feature selection

Backward feature selection: greedily remove features; evaluate on validation dataset; stop when no improvement

Pros

Interact with the classifier
No feature-independence assumption

Cons

Computationally intensive

27

28 of 73

Data collection

28

Identification of features

Removing irrelevant features

Unsupervised/ supervised discretization

Discretization?

Normalization?

Standardization etc.

Y

N

Y

N

Class label available?

Y

N

Data labeling

29 of 73

Feature normalization

Some classifiers can’t easily handle features with very different scales, e.g.,

Revenue in CHF: 10,000,000
# of employees: 300

Features with large values dominate the others, and the classifier tends to over-optimize for them
Even single feature may span many orders of magnitude

e.g., city size (most cities small, some huge)
Too little resolution where data is dense, too much resolution where data is sparse

29

30 of 73

Logarithmic scaling

x_i’ = log(x_i)

Consider order of magnitude, rather than direct value
Good for heavy-tailed features (e.g., from power laws)

30

31 of 73

Min-max scaling

x_i’ = (x_i – m_i)/(M_i – m_i)

where M_i and m_i are the max and min values of feature x_i respectively

The new feature x_i’ lies in the interval [0,1]

31

32 of 73

Standardization

x_i’ = (x_i – μ_i)/σ_i

where μ_i is the mean value of feature x_i, and σ_i is the standard deviation

The new feature x_i’ has mean 0 and standard deviation 1

32

33 of 73

Dangers of standardization and scaling

Standardization:

Assume that the data has been generated by a Gaussian
Uses mean and std → not meaningful for heavy-tailed data (may be mitigated by log-scaling)

Min-max scaling:

If the data has outliers, they scale the typical values to a very small interval

33

34 of 73

34

Commercial break

35 of 73

Classification pipeline

35

Data collection

Model assessment

Model selection

36 of 73

Model selection: high level

Need to choose type of model

k-NN?
Decision trees?
Random forest?
Boosted decision trees?
Logistic regression?
Deep learning?
…

36

37 of 73

Model selection: low level

Usually a classifier has some “hyperparameters” to be tuned

Set of features to include
Distance function (e.g., k-NN)
Number of neighbors (e.g., k-NN)
Number of trees (e.g., random forest)
Decision threshold (e.g., logistic regression)
Regularization parameter
Learning rate (for gradient descent algorithms)

37

38 of 73

Loss function (more of them later!)

Categorical output

e.g., 0-1 loss function, risk (= 1 minus accuracy):

Real-valued output

e.g., squared error:

e.g., absolute error:

38

39 of 73

Model selection: on what data to evaluate?

What if you can’t afford a 3-way split because you have too little data?

→ Cross-validation! (p.t.o.)

39

Full dataset

Training set

Testing set

Validation set

40 of 73

Cross-validation

40

Last lecture: leave-one-out cross-validation == N-fold cross-validation (where N is #data points)
More efficient: m-fold cross-validation (in above picture: m = 5)
Average performance over the m red portions → validation error
Repeat procedure for all candidate models and pick the one with the lowest validation error

41 of 73

Model selection

41

Hyperparameter value

Validation error

42 of 73

Performance metrics for binary classification

For categorical binary classification, the usual metrics are based on the confusion matrix, which has 4 values:

True Positives (positive examples classified as positive)
True Negatives (negative examples classified as negative)
False Positives (negative examples classified as positive)
False Negatives (positive examples classified as negative)

42

		Class
		Pos	Neg
Classified	Pos	TP	FP
Classified	Neg	FN	TN

43 of 73

Accuracy

Appropriate metric when

classes are not skewed
errors (FP, FN) have the same importance

43

44 of 73

Accuracy (skewed example)

Accuracy = 85/100 = 85%

Accuracy = 90/100 = 90%

44

Classifier 1		Class
Classifier 1		Fraud	¬Fraud
Classified	Fraud	5	10
Classified	¬Fraud	5	80

Always ¬Fraud		Class
Always ¬Fraud		Fraud	¬Fraud
Classified	Fraud	0	0
Classified	¬Fraud	10	90

45 of 73

Poll time

Which classifier is better?

Classifier 1
Classifier 2
Both are equally good

45

Classifier 1		Class
Classifier 1		Cancer	¬Cancer
Classified	Cancer	45	20
Classified	¬Cancer	5	30

Classifier 2		Class
Classifier 2		A	B
Classified	Cancer	40	10
Classified	¬Cancer	10	40

POLLING TIME

Scan QR code or go to�https://go.epfl.ch/ada2025-lec8-poll

�

100 data points

46 of 73

Poll time

Which classifier is better?

Classifier 1
Classifier 2
Both are equally good

46

Classifier 1		Class
Classifier 1		Cancer	¬Cancer
Classified	Cancer	45	20
Classified	¬Cancer	5	30

Classifier 2		Class
Classifier 2		Cancer	¬Cancer
Classified	Cancer	40	10
Classified	¬Cancer	10	40

47 of 73

Precision and recall

Precision: what fraction of positive�predictions are actually positive?

Recall: what fraction of actually positive examples did I recognize as such?

47

[source]

48 of 73

Precision and recall

P₁=45/65=0.69 P₂=40/50=0.8

R₁=45/50=0.9 R₂=40/50=0.8

48

Classifier 1		Class
Classifier 1		Cancer	¬Cancer
Classified	Cancer	45	20
Classified	¬Cancer	5	30

Classifier 2		Class
Classifier 2		Cancer	¬Cancer
Classified	Cancer	40	10
Classified	¬Cancer	10	40

Everybody has cancer		Class
Everybody has cancer		Cancer	¬Cancer
Classified	Cancer	50	50
Classified	¬Cancer	0	0

P = 50/100 = 0.5

R = 50/50 = 1

49 of 73

F-score

Sometimes it’s necessary to have a single�metric to compare classifiers

F-score (or F1-score): harmonic mean of precision and recall

Precision and recall can be differently weighted, if one is more important than the other

49

F1 = 1 / (0.5 * (1/P + 1/R)) =

50 of 73

Precision and recall

F₁=2*(0.69*0.9)/(0.69+0.9) F₂=2*(0.8*0.8)/(0.8+0.8)

= 0.78 =0.8

50

Classifier 1		Class
Classifier 1		Cancer	¬Cancer
Classified	Cancer	45	20
Classified	¬Cancer	5	30

Classifier 2		Class
Classifier 2		Cancer	¬Cancer
Classified	Cancer	40	10
Classified	¬Cancer	10	40

Everybody has cancer		Class
Everybody has cancer		Cancer	¬Cancer
Classified	Cancer	50	50
Classified	¬Cancer	0	0

F=2*(0.5*1)/(0.5+1)=0.66

51 of 73

Precision/recall curve

51

Decreasing classification threshold

52 of 73

ROC curve

ROC = Receiver-Operating Characteristic (WTF?!)

Y-axis: true-positive rate = TP/(TP + FN), a.k.a. recall

X-axis: false-positive rate = FP/(FP + TN)

52

Decreasing classification threshold

False-positive rate

True-positive rate

53 of 73

ROC AUC

ROC AUC is the “area under the curve” – a single number that captures the overall quality of the classifier. It should be between 0.5 (random classifier) and 1.0 (perfect).

53

Random ordering

area = 0.5

False-positive rate

True-positive rate

54 of 73

Bias and variance

54

[lecture 7]

55 of 73

55

Validation Error

56 of 73

How to know where on the x-axis you are (without fiddling with model complexity)?

56

Validation Error

57 of 73

Bias and variance

Bias and variance can be assessed by comparing the error metric on the training set and the validation set => always plot learning curves (training set size vs. training/validation errors)

57

Training error

Validation error

58 of 73

When more�data helps

58

High bias

More data doesn’t help

High variance

More data helps

Fixed data set size�Varying model complexity

Fixed model complexity�Varying data set size

For curious ADAventurers:�“Reconciling modern machine-learning practice and the classical bias–variance trade-off”

https://www.pnas.org/doi/abs/10.1073/pnas.1903070116

Training error

Validation error

Training error

Validation error

59 of 73

Classification pipeline

59

Data collection

Model assessment

Model selection

60 of 73

Model assessment

Model assessment is the goal of estimating the performance of a fixed model (i.e., the best model found during model selection)
Ideally under real-world conditions
Use held-out test set that you’ve never seen during training

60

61 of 73

Useful reads

61

62 of 73

62

slides

63 of 73

Feedback

63

Give us feedback on this lecture here: https://go.epfl.ch/ada2025-lec8-feedback

What did you (not) like about this lecture?
What was (not) well explained?
On what would you like more (fewer) details?
…

64 of 73

Crowdsourcing

Different types of workers

Truthful

Expert
Normal

Untruthful

Uniform spammer
Random spammer
Malicious spammer (a.k.a. a$s#*le)

64

Random Spammer

Uniform

Spammer

Malicious

Spammer

Expert

Normal

Worker

True negative rate

True positive rate

Uniform

Spammer

65 of 73

Catching malicious spammers

Insert obvious examples for which you already know the labels (“honeypot”)

Tell workers they won’t be paid if they don’t get those right
Filter out workers who don’t get them right

Aggregate multiple answers

p.t.o.

65

66 of 73

Crowdsourcing

Answer aggregation problem

Have each example labeled by several workers, aggregate:

e.g., majority vote (works if only a minority is malicious)
e.g., “peer prediction”, “prediction markets” (game theory: workers are paid more if they agree with others)

66

Crowd (workers)

Worker	Webpage	Credible
W₁	www.diet.com	C
W₂	www.diet.com	¬C
W₃	www.diet.com	C
...	...	...

Aggregation

www.diet.com	C

67 of 73

Recap

Model selection:

Use training data in cross-validation
Need evaluation metric

Typically based on confusion matrix
e.g., accuracy, precision, recall

67

Data collection

Model assessment

Model selection

		Class
		A	B
Classified	A	TP	FP
Classified	B	FN	TN

68 of 73

Model selection

Need evaluation metric!

68

Split dataset into “training” and “validation”

Evaluate classifier with validation set

Train classifier with training set

Performance acceptable?

Y

N

Set classifier parameters

69 of 73

Fwd-selected features vs. performance

69

70 of 73

Training and testing with heaps of data

70

database D

training set

test set

60% of D

40% of D

learn model

evaluate model

performance metric

71 of 73

Data-efficient training and testing:

Leave-one-out cross-validation

71

database D

training set

test set

(N-1)/N of D

1/N of D

learn model

evaluate model

Repeat N times

average of N runs

72 of 73

Data-efficient training and testing:

k-fold cross validation

72

database D

training set

test set

(k-1)/k of D

1/k of D

learn model

evaluate model

average of k runs

Repeat k times

73 of 73

More data often beats better algorithms

73