1
Michele @pirroh Catasta
07 - Applied ML
Announcements/feedback
HW02 grades communicated this week
HW04 delayed one week
Next lab session: hands on ML
I’ve read your project proposals -- great job :-)
2
Classification pipeline
3
Data collection
Model assessment
Model selection
Data collection
The first step is collecting data related to the classification task.
Domain knowledge is needed.
What if assigning the class label is time consuming/impossible?
4
Data collection
5
Identification of features
Removing irrelevant features
Unsupervised/supervised discretisation
Discretisation?
Normalisation?
Standardisation/scaling
Y
N
Y
N
Class label available?
Y
Crowdsourcing labelling
N
Features
Different types of features
New features can be generated from simple stats
Some classifiers require categorical features => Discretisation
6
A Brief History of Machine Learning
* Before publication of Krizhevsky et al.’s ImageNet CNN paper.
7
Cleverly-Designed�Features
Input Data
ML model
Most of the “heavy lifting” in here.
Final performance only as good as the�feature set.
A Brief History of Machine Learning
8
Features
Input Data
model
Deep Learning
Features and model learned together,�mutually reinforcing
Data collection
9
Identification of features
Removing irrelevant features
Unsupervised/supervised discretisation
Discretisation?
Normalisation?
Standardisation/scaling
Y
N
Y
N
Class label available?
Y
Crowdsourcing labelling
N
Labels
Collecting a lot of data is easy
Labelling data is time consuming, difficult and sometimes even impossible
10
Expert in diets is needed
11
Requester
Crowdsourcing
platform
1. Submit task
2. Accept task
Crowd (workers)
3. Return answers
4. Collect answers
Is this webpage
credible?
C C ¬C C ¬C
Crowdsourcing
Different types of workers
12
Random Spammer
Uniform
Spammer
Sloppy
Worker
Expert
Normal
Worker
True negative rate
True positive rate
Uniform
Spammer
Crowdsourcing
Answer aggregation problem
13
Crowd (workers)
Worker | Webpage | Credible |
W1 | www.diet.com | C |
W2 | www.diet.com | ¬C |
W3 | www.diet.com | C |
... | ... | ... |
Aggregation
www.diet.com | C |
Data collection
14
Identification of features
Removing irrelevant features
Unsupervised/supervised discretisation
Discretisation?
Normalisation?
Standardisation/scaling
Y
N
Y
N
Class label available?
Y
Crowdsourcing labelling
N
Discretisation
Unsupervised
15
Discretisation
Supervised
16
Data collection
17
Identification of features
Removing irrelevant features
Unsupervised/supervised discretisation
Discretisation?
Normalisation?
Standardisation/scaling
Y
N
Y
N
Class label available?
Y
Crowdsourcing labelling
N
Feature selection
Reducing the number of N features to a subset with the best M < N
There are 2N possible subsets
Solutions
18
Feature selection
Filtering: rank features according to their predictive power and select the best ones
19
Ranking of features
Numerical
20
Ranking of features
Categorical
21
Ranking of features
Beware of trusting correlations in a blind way
22
23
Ranking of features
Collectively relevant features may look individually irrelevant!
24
Feature selection
Wrapper: iteratively add features, using cross-validation to guide feature inclusion and stopping when no improvement
25
Feature selection
Ablation: iteratively remove features, using cross-validation to guide feature inclusion and stopping when no improvement
26
Example of Feature Count vs. Accuracy
27
Data collection
28
Identification of features
Removing irrelevant features
Unsupervised/supervised discretisation
Discretisation?
Normalisation?
Standardisation/scaling
Y
N
Y
N
Class label available?
Y
Crowdsourcing labelling
N
Feature normalisation
Some classifiers do not manage well features with very different scales
Features with large values dominate the others, and the classifier tend to over-optimise them
29
Standardisation
xi’ = (xi – μi)/σi
where μi is the mean value of feature xi, and σi is the standard deviation
The new feature xi’ has mean 0 and �standard deviation 1
30
Scaling
xi’ = (xi – mi)/(Mi – mi)
where Mi and mi are the max and min values of feature xi respectively
The new feature xi’ lies in the interval [0,1]
31
Standardisation vs Scaling
Standardisation:
Scaling:
32
Classification pipeline
33
Data collection
Model assessment
Model selection
34
Model selection
Usually a classifier has some parameters to be tuned
A performance metric is needed
35
Model selection
36
Split dataset into “training” and “validation”
Evaluate classifier with validation set
Train classifier with training set
Performance acceptable?
Y
N
Set classifier parameters
Loss function
Categorical output
Real value output
37
Model selection
38
Parameter value
Loss function
Performance metric for Binary Classification
For categorical binary classification, the usual metrics consider four types of errors
39
| | Class | |
| | A | B |
Classified | A | TP | FP |
B | FN | TN | |
Accuracy
appropriate metric when
40
Accuracy (skewed example)
A = 85/100 = 0.85
A = 90/100 = 0.90
41
Classifier 1 | Class | ||
Fraud | ¬Fraud | ||
Classified | Fraud | 5 | 10 |
¬Fraud | 5 | 80 | |
Always ¬Fraud | Class | ||
Fraud | ¬Fraud | ||
Classified | Fraud | 0 | 0 |
¬Fraud | 10 | 90 | |
Question time
Which is the “best” classifier?
42
Classifier 1 | Class | ||
A | B | ||
Classified | A | 45 | 20 |
B | 5 | 30 | |
Classifier 2 | Class | ||
A | B | ||
Classified | A | 40 | 10 |
B | 10 | 40 | |
Question time
Which is the “best” classifier?
43
Classifier 1 | Class | ||
Cancer | ¬Cancer | ||
Classified | Cancer | 45 | 20 |
¬Cancer | 5 | 30 | |
Classifier 2 | Class | ||
Cancer | ¬Cancer | ||
Classified | Cancer | 40 | 10 |
¬Cancer | 10 | 40 | |
Precision and recall
Precision
Recall
44
Precision and recall
P1=45/65=0.69 P2=40/50=0.8
R1=45/50=0.9 R2=40/50=0.8
45
Classifier 1 | Class | ||
Cancer | ¬Cancer | ||
Classified | Cancer | 45 | 20 |
¬Cancer | 5 | 30 | |
Classifier 2 | Class | ||
Cancer | ¬Cancer | ||
Classified | Cancer | 40 | 10 |
¬Cancer | 10 | 40 | |
Everybody has cancer | Class | ||
Cancer | ¬Cancer | ||
Classified | Cancer | 50 | 50 |
¬Cancer | 0 | 0 | |
P = 50/100 = 0.5
R = 50/50 = 1
F-score
Sometimes it’s necessary to have a unique metric to compare classifiers
F score (or F1 score)
Precision and Recall can be differently weighted, if one is more important than the other
46
Precision and recall
F1=2*(0.69*0.9)/(0.69+0.9) F2=2*(0.8*0.8)/(0.8+0.8)
= 0.78 =0.8
47
Classifier 1 | Class | ||
Cancer | ¬Cancer | ||
Classified | Cancer | 45 | 20 |
¬Cancer | 5 | 30 | |
Classifier 2 | Class | ||
Cancer | ¬Cancer | ||
Classified | Cancer | 40 | 10 |
¬Cancer | 10 | 40 | |
Everybody has cancer | Class | ||
Cancer | ¬Cancer | ||
Classified | Cancer | 50 | 50 |
¬Cancer | 0 | 0 | |
F=2*(0.5*1)/(0.5+1)=0.66
ROC plots
ROC is Receiver-Operating Characteristic. ROC plots
Y-axis: true positive rate = tp/(tp + fn), same as recall
X-axis: false positive rate = fp/(fp + tn) = 1 - specificity
48
Score increasing
ROC AUC
ROC AUC is the “Area Under the Curve” – a single number that captures the overall quality of the classifier. It should be between 0.5 (random classifier) and 1.0 (perfect).
49
Random ordering
area = 0.5
Training and test set
50
database D
training set
test set
60% of D
40% of D
learn model
evaluate model
performance metric
Leave-one-out cross-validation
51
database D
training set
test set
(N-1)/N of D
1/N of D
learn model
evaluate model
Repeat N times
average of N runs
K-fold cross validation
52
database D
training set
test set
(k-1)/k of D
1/k of D
learn model
evaluate model
average of k runs
Repeat k times
Bias and variance
Bias and variance can be assessed by comparing the error metric on the training set and the test set => always plot learning curves
In the ideal case, we want low bias (small training error) and low variance (small test error)
53
Bias and variance
54
When more data helps
55
High bias
High variance
When more data helps
56
Data > Algorithms
(take it with a grain of salt…:-)
Classification pipeline
©2016, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis
57
Data collection
Model assessment
Model selection
Model assessment
Model assessment is the goal of estimating the classification accuracy of a fixed model (i.e., the best model found during model selection)
Evaluate the model using all the dataset:
58
Important Reads
59
60
check this slides!
Sources
61