1
Applied Data Analysis (CS401)
Maria Brbic
Lecture 8
Learning from data:
Applied machine learning
5 Nov 2025
Announcements
2
Feedback
3
Give us feedback on this lecture here: https://go.epfl.ch/ada2025-lec8-feedback
Why an extra class on applied ML?
4
Classic ML
class
ADA
Classification pipeline
5
Data collection
Model assessment
Model selection
Data collection
The first step is collecting data related to the classification task.
Domain knowledge is needed.
What if assigning the class label would be too time-consuming or even impossible?
6
Data collection
7
Identification of features
Removing irrelevant features
Unsupervised/ supervised discretization
Discretization?
Normalization?
Standardization etc.
Y
N
Y
N
Class label available?
Y
N
Data labeling
Features
Different types of features [more]
New features can be generated from simple stats
Some classifiers require categorical features => discretization
8
ML before 2012*
9
Cleverly designed�features
Input data
ML model
Much of the “heavy lifting” in here.
Final performance only as good as the�feature set.
A typical ML approach after 2012
10
Features
Input data
Model
Deep learning
Features and model learned together,�mutually reinforcing
Data collection
11
Identification of features
Removing irrelevant features
Unsupervised/ supervised discretization
Discretization?
Normalization?
Standardization etc.
Y
N
Y
N
Class label available?
Y
N
Data labeling
Labels
Collecting a lot of data (features) is often easy.
Labeling data is time consuming, difficult, and sometimes even impossible.
12
Label: “Is page credible?”
Human dietary expert is needed
Potential labelers
13
14
Requester
Crowdsourcing
platform
1. Submit task
2. Accept task
Crowd workers
3. Return answers
4. Collect answers
“Is this webpage
credible?”
Credible
Credible
Not credible
Credible
Not credible
Data collection
15
Identification of features
Removing irrelevant features
Unsupervised/ supervised discretization
Discretization?
Normalization?
Standardization etc.
Y
N
Y
N
Class label available?
Y
N
Data labeling
Discretization
Why?
16
Discretization
Unsupervised
17
Discretization
Supervised
18
Data collection
19
Identification of features
Removing irrelevant features
Unsupervised/ supervised discretization
Discretization?
Normalization?
Standardization etc.
Y
N
Y
N
Class label available?
Y
N
Data labeling
Removing irrelevant features: Feature selection
20
Offline feature selection
Rank features according to their individual predictive power; then select the best ones
Pros:
Cons:
21
Ranking of features
Continuous features (and ideally labels):
Categorical features and labels:
22
Ranking of features
Categorical features and labels (cont’d):
23
Ranking of features
Beware: collectively relevant features may look individually irrelevant!
24
Ranking of features
Beware: collectively relevant features may look individually irrelevant!
25
Online feature selection
Forward feature selection: greedily add
features; evaluate on validation dataset; stop when no improvement
Pros
Cons
26
Online feature selection
Backward feature selection: greedily remove features; evaluate on validation dataset; stop when no improvement
Pros
Cons
27
Data collection
28
Identification of features
Removing irrelevant features
Unsupervised/ supervised discretization
Discretization?
Normalization?
Standardization etc.
Y
N
Y
N
Class label available?
Y
N
Data labeling
Feature normalization
29
Logarithmic scaling
xi’ = log(xi)
30
Min-max scaling
xi’ = (xi – mi)/(Mi – mi)
where Mi and mi are the max and min values of feature xi respectively
The new feature xi’ lies in the interval [0,1]
31
Standardization
xi’ = (xi – μi)/σi
where μi is the mean value of feature xi, and σi is the standard deviation
The new feature xi’ has mean 0 and standard deviation 1
32
Dangers of standardization and scaling
Standardization:
Min-max scaling:
33
34
Commercial break
Classification pipeline
35
Data collection
Model assessment
Model selection
Model selection: high level
Need to choose type of model
36
Model selection: low level
Usually a classifier has some “hyperparameters” to be tuned
37
Loss function (more of them later!)
Categorical output
Real-valued output
38
Model selection: on what data to evaluate?
What if you can’t afford a 3-way split because you have too little data?
→ Cross-validation! (p.t.o.)
39
Full dataset
Training set
Testing set
Validation set
Cross-validation
40
Model selection
41
Hyperparameter value
Validation error
Performance metrics for binary classification
For categorical binary classification, the usual metrics are based on the confusion matrix, which has 4 values:
42
| | Class | |
| | Pos | Neg |
Classified | Pos | TP | FP |
Neg | FN | TN | |
Accuracy
Appropriate metric when
43
Accuracy (skewed example)
Accuracy = 85/100 = 85%
Accuracy = 90/100 = 90%
44
Classifier 1 | Class | ||
Fraud | ¬Fraud | ||
Classified | Fraud | 5 | 10 |
¬Fraud | 5 | 80 | |
Always ¬Fraud | Class | ||
Fraud | ¬Fraud | ||
Classified | Fraud | 0 | 0 |
¬Fraud | 10 | 90 | |
Poll time
Which classifier is better?
45
Classifier 1 | Class | ||
Cancer | ¬Cancer | ||
Classified | Cancer | 45 | 20 |
¬Cancer | 5 | 30 | |
Classifier 2 | Class | ||
A | B | ||
Classified | Cancer | 40 | 10 |
¬Cancer | 10 | 40 | |
100 data points
100 data points
Poll time
Which classifier is better?
46
Classifier 1 | Class | ||
Cancer | ¬Cancer | ||
Classified | Cancer | 45 | 20 |
¬Cancer | 5 | 30 | |
Classifier 2 | Class | ||
Cancer | ¬Cancer | ||
Classified | Cancer | 40 | 10 |
¬Cancer | 10 | 40 | |
Precision and recall
Precision: what fraction of positive�predictions are actually positive?
Recall: what fraction of actually positive examples did I recognize as such?
47
[source]
Precision and recall
P1=45/65=0.69 P2=40/50=0.8
R1=45/50=0.9 R2=40/50=0.8
48
Classifier 1 | Class | ||
Cancer | ¬Cancer | ||
Classified | Cancer | 45 | 20 |
¬Cancer | 5 | 30 | |
Classifier 2 | Class | ||
Cancer | ¬Cancer | ||
Classified | Cancer | 40 | 10 |
¬Cancer | 10 | 40 | |
Everybody has cancer | Class | ||
Cancer | ¬Cancer | ||
Classified | Cancer | 50 | 50 |
¬Cancer | 0 | 0 | |
P = 50/100 = 0.5
R = 50/50 = 1
F-score
Sometimes it’s necessary to have a single�metric to compare classifiers
F-score (or F1-score): harmonic mean of precision and recall
Precision and recall can be differently weighted, if one is more important than the other
49
F1 = 1 / (0.5 * (1/P + 1/R)) =
Precision and recall
F1=2*(0.69*0.9)/(0.69+0.9) F2=2*(0.8*0.8)/(0.8+0.8)
= 0.78 =0.8
50
Classifier 1 | Class | ||
Cancer | ¬Cancer | ||
Classified | Cancer | 45 | 20 |
¬Cancer | 5 | 30 | |
Classifier 2 | Class | ||
Cancer | ¬Cancer | ||
Classified | Cancer | 40 | 10 |
¬Cancer | 10 | 40 | |
Everybody has cancer | Class | ||
Cancer | ¬Cancer | ||
Classified | Cancer | 50 | 50 |
¬Cancer | 0 | 0 | |
F=2*(0.5*1)/(0.5+1)=0.66
Precision/recall curve
51
Decreasing classification threshold
ROC curve
ROC = Receiver-Operating Characteristic (WTF?!)
Y-axis: true-positive rate = TP/(TP + FN), a.k.a. recall
X-axis: false-positive rate = FP/(FP + TN)
52
Decreasing classification threshold
False-positive rate
True-positive rate
ROC AUC
ROC AUC is the “area under the curve” – a single number that captures the overall quality of the classifier. It should be between 0.5 (random classifier) and 1.0 (perfect).
53
Random ordering
area = 0.5
False-positive rate
True-positive rate
Bias and variance
54
55
Validation Error
How to know where on the x-axis you are (without fiddling with model complexity)?
56
Validation Error
Bias and variance
Bias and variance can be assessed by comparing the error metric on the training set and the validation set => always plot learning curves (training set size vs. training/validation errors)
57
Training error
Validation error
When more�data helps
58
High bias
More data doesn’t help
High variance
More data helps
Fixed data set size�Varying model complexity
Fixed model complexity�Varying data set size
For curious ADAventurers:�“Reconciling modern machine-learning practice and the classical bias–variance trade-off”
Training error
Validation error
Training error
Validation error
Classification pipeline
59
Data collection
Model assessment
Model selection
Model assessment
60
Useful reads
61
62
Feedback
63
Give us feedback on this lecture here: https://go.epfl.ch/ada2025-lec8-feedback
Crowdsourcing
Different types of workers
64
Random Spammer
Uniform
Spammer
Malicious
Spammer
Expert
Normal
Worker
True negative rate
True positive rate
Uniform
Spammer
Catching malicious spammers
65
Crowdsourcing
Answer aggregation problem
66
Crowd (workers)
Worker | Webpage | Credible |
W1 | www.diet.com | C |
W2 | www.diet.com | ¬C |
W3 | www.diet.com | C |
... | ... | ... |
Aggregation
www.diet.com | C |
Recap
67
Data collection
Model assessment
Model selection
| | Class | |
| | A | B |
Classified | A | TP | FP |
B | FN | TN | |
Model selection
Need evaluation metric!
68
Split dataset into “training” and “validation”
Evaluate classifier with validation set
Train classifier with training set
Performance acceptable?
Y
N
Set classifier parameters
Fwd-selected features vs. performance
69
Training and testing with heaps of data
70
database D
training set
test set
60% of D
40% of D
learn model
evaluate model
performance metric
Data-efficient training and testing:
Leave-one-out cross-validation
71
database D
training set
test set
(N-1)/N of D
1/N of D
learn model
evaluate model
Repeat N times
average of N runs
Data-efficient training and testing:
k-fold cross validation
72
database D
training set
test set
(k-1)/k of D
1/k of D
learn model
evaluate model
average of k runs
Repeat k times
More data often beats better algorithms
73