1 of 61

1

Michele @pirroh Catasta

07 - Applied ML

2 of 61

Announcements/feedback

HW02 grades communicated this week

HW04 delayed one week

Next lab session: hands on ML

I’ve read your project proposals -- great job :-)

2

3 of 61

Classification pipeline

3

Data collection

Model assessment

Model selection

4 of 61

Data collection

The first step is collecting data related to the classification task.

Definition of the attributes (or features) that describe a data item and the class label.

Domain knowledge is needed.

What if assigning the class label is time consuming/impossible?

4

5 of 61

Data collection

5

Identification of features

Removing irrelevant features

Unsupervised/supervised discretisation

Discretisation?

Normalisation?

Standardisation/scaling

Y

N

Y

N

Class label available?

Y

Crowdsourcing labelling

N

6 of 61

Features

Different types of features

Numerical (e.g., age, temperature ...)
Ordinal (e.g., phone code ...)
Categorical (e.g., student, weather ...)

New features can be generated from simple stats

Feature engineering is considered a form of art, therefore the best suggestion is to find what other people did for similar problems

Some classifiers require categorical features => Discretisation

6

7 of 61

A Brief History of Machine Learning

Before 2012*, but still very common today:

* Before publication of Krizhevsky et al.’s ImageNet CNN paper.

7

Cleverly-Designed�Features

Input Data

ML model

Most of the “heavy lifting” in here.

Final performance only as good as the�feature set.

8 of 61

A Brief History of Machine Learning

After 2012:

8

Features

Input Data

model

Deep Learning

Features and model learned together,�mutually reinforcing

9 of 61

Data collection

9

Identification of features

Removing irrelevant features

Unsupervised/supervised discretisation

Discretisation?

Normalisation?

Standardisation/scaling

Y

N

Y

N

Class label available?

Y

Crowdsourcing labelling

N

10 of 61

Labels

Collecting a lot of data is easy

Labelling data is time consuming, difficult and sometimes even impossible

10

Expert in diets is needed

11 of 61

11

Requester

Crowdsourcing

platform

1. Submit task

2. Accept task

Crowd (workers)

3. Return answers

4. Collect answers

Is this webpage

credible?

C C ¬C C ¬C

12 of 61

Crowdsourcing

Different types of workers

Truthful

Expert
Normal

Untruthful

Sloppy
Uniform spammer
Random spammer

12

Random Spammer

Uniform

Spammer

Sloppy

Worker

Expert

Normal

Worker

True negative rate

True positive rate

Uniform

Spammer

13 of 61

Crowdsourcing

Answer aggregation problem

13

Crowd (workers)

Worker	Webpage	Credible
W₁	www.diet.com	C
W₂	www.diet.com	¬C
W₃	www.diet.com	C
...	...	...

Aggregation

www.diet.com	C

14 of 61

Data collection

14

Identification of features

Removing irrelevant features

Unsupervised/supervised discretisation

Discretisation?

Normalisation?

Standardisation/scaling

Y

N

Y

N

Class label available?

Y

Crowdsourcing labelling

N

15 of 61

Discretisation

Unsupervised

Equal width

Divide the range into a predefined number of bins

Equal frequency

Divide the range into a predefined number of bins so that every interval contains the same number of values

Clustering

15

Unsupervised discretisation do not take class information into account.

The obvious weakness of the equal-width method is that in cases where the outcome observations are not distributed evenly, a large amount of important

information can be lost after the discretization process.

For equal-frequency, many occurrences of a continuous value could cause the occurrences to be assigned into different bins. One improvement can be after continuous values are assigned into bins, boundaries of every pair of neighbouring bins are adjusted so that duplicate values should belong to one bin only.

The advantage of using clustering as discretisation is that it can be performed on all the features at the same time, capturing in this way possible interdependencies of the features. Sometimes discretisation can even improve the performance of algorithms that do not need discretisation

16 of 61

Discretisation

Supervised

Test the hypothesis that two adjacent intervals of a feature are independent of the class
If they are independent, they should be merged
Otherwise they should remain separate
Independence test: χ² statistics

16

17 of 61

Data collection

17

Identification of features

Removing irrelevant features

Unsupervised/supervised discretisation

Discretisation?

Normalisation?

Standardisation/scaling

Y

N

Y

N

Class label available?

Y

Crowdsourcing labelling

N

18 of 61

Feature selection

Reducing the number of N features to a subset with the best M < N

There are 2^N possible subsets

Solutions

Filtering
Wrapper

18

19 of 61

Feature selection

Filtering: rank features according to their predictive power and select the best ones

Pros

Independent of the classifier (performed only once)

Cons

Independent of the classifier (ignore interaction with the classifier)
Assume features are independent

19

20 of 61

Ranking of features

Numerical

Correlation coefficient (Pearson, linear rel.!)

Mutual information

20

21 of 61

Ranking of features

Categorical

χ² method�Different to correlation, the chi-square test checks the independence of the class and the feature, without indicating the strength or direction of any existing relationship.

21

22 of 61

Ranking of features

Beware of trusting correlations in a blind way

22

23 of 61

23

24 of 61

Ranking of features

Collectively relevant features may look individually irrelevant!

24

25 of 61

Feature selection

Wrapper: iteratively add features, using cross-validation to guide feature inclusion and stopping when no improvement

Pros

Interact with the classifier
No independence assumption

Cons

Computationally intensive

25

26 of 61

Feature selection

Ablation: iteratively remove features, using cross-validation to guide feature inclusion and stopping when no improvement

Pros

Interact with the classifier
No independence assumption

Cons

Computationally intensive

26

27 of 61

Example of Feature Count vs. Accuracy

27

28 of 61

Data collection

28

Identification of features

Removing irrelevant features

Unsupervised/supervised discretisation

Discretisation?

Normalisation?

Standardisation/scaling

Y

N

Y

N

Class label available?

Y

Crowdsourcing labelling

N

29 of 61

Feature normalisation

Some classifiers do not manage well features with very different scales

Revenue: 10,000,000 (CHF)
# of employees: 300

Features with large values dominate the others, and the classifier tend to over-optimise them

29

30 of 61

Standardisation

x_i’ = (x_i – μ_i)/σ_i

where μ_i is the mean value of feature x_i, and σ_i is the standard deviation

The new feature x_i’ has mean 0 and �standard deviation 1

30

31 of 61

Scaling

x_i’ = (x_i – m_i)/(M_i – m_i)

where M_i and m_i are the max and min values of feature x_i respectively

The new feature x_i’ lies in the interval [0,1]

31

32 of 61

Standardisation vs Scaling

Standardisation:

Assume that the data has been generated by a Gaussian process

Scaling:

If the data has outliers, they scale the “normal” values to a very small interval

32

33 of 61

Classification pipeline

33

Data collection

Model assessment

Model selection

34 of 61

34

35 of 61

Model selection

Usually a classifier has some parameters to be tuned

Regularisation factor
Threshold
Distance function
Number of neighbours

A performance metric is needed

35

36 of 61

Model selection

36

Split dataset into “training” and “validation”

Evaluate classifier with validation set

Train classifier with training set

Performance acceptable?

Y

N

Set classifier parameters

37 of 61

Loss function

Categorical output

0-1 loss function:

Real value output

Squared error:

Absolute error:

37

38 of 61

Model selection

38

Parameter value

Loss function

39 of 61

Performance metric for Binary Classification

For categorical binary classification, the usual metrics consider four types of errors

True Positive (positive examples classified as positive)
True Negative (negative examples classified as negative)
False Positive (negative examples classified as positive)
False Negative (positive examples classified as negative)

39

		Class
		A	B
Classified	A	TP	FP
Classified	B	FN	TN

40 of 61

Accuracy

appropriate metric when

Classes are not skewed
Errors have the same importance

40

41 of 61

Accuracy (skewed example)

A = 85/100 = 0.85

A = 90/100 = 0.90

41

Classifier 1		Class
Classifier 1		Fraud	¬Fraud
Classified	Fraud	5	10
Classified	¬Fraud	5	80

Always ¬Fraud		Class
Always ¬Fraud		Fraud	¬Fraud
Classified	Fraud	0	0
Classified	¬Fraud	10	90

42 of 61

Question time

Which is the “best” classifier?

Classifier 1
Classifier 2
Both are equally good

42

Classifier 1		Class
Classifier 1		A	B
Classified	A	45	20
Classified	B	5	30

Classifier 2		Class
Classifier 2		A	B
Classified	A	40	10
Classified	B	10	40

43 of 61

Question time

Which is the “best” classifier?

Classifier 1
Classifier 2
Both are equally good

43

Classifier 1		Class
Classifier 1		Cancer	¬Cancer
Classified	Cancer	45	20
Classified	¬Cancer	5	30

Classifier 2		Class
Classifier 2		Cancer	¬Cancer
Classified	Cancer	40	10
Classified	¬Cancer	10	40

44 of 61

Precision and recall

Precision

Recall

44

45 of 61

Precision and recall

P₁=45/65=0.69 P₂=40/50=0.8

R₁=45/50=0.9 R₂=40/50=0.8

45

Classifier 1		Class
Classifier 1		Cancer	¬Cancer
Classified	Cancer	45	20
Classified	¬Cancer	5	30

Classifier 2		Class
Classifier 2		Cancer	¬Cancer
Classified	Cancer	40	10
Classified	¬Cancer	10	40

Everybody has cancer		Class
Everybody has cancer		Cancer	¬Cancer
Classified	Cancer	50	50
Classified	¬Cancer	0	0

P = 50/100 = 0.5

R = 50/50 = 1

46 of 61

F-score

Sometimes it’s necessary to have a unique metric to compare classifiers

F score (or F1 score)

Precision and Recall can be differently weighted, if one is more important than the other

46

47 of 61

Precision and recall

F₁=2*(0.69*0.9)/(0.69+0.9) F₂=2*(0.8*0.8)/(0.8+0.8)

= 0.78 =0.8

47

Classifier 1		Class
Classifier 1		Cancer	¬Cancer
Classified	Cancer	45	20
Classified	¬Cancer	5	30

Classifier 2		Class
Classifier 2		Cancer	¬Cancer
Classified	Cancer	40	10
Classified	¬Cancer	10	40

Everybody has cancer		Class
Everybody has cancer		Cancer	¬Cancer
Classified	Cancer	50	50
Classified	¬Cancer	0	0

F=2*(0.5*1)/(0.5+1)=0.66

48 of 61

ROC plots

ROC is Receiver-Operating Characteristic. ROC plots

Y-axis: true positive rate = tp/(tp + fn), same as recall

X-axis: false positive rate = fp/(fp + tn) = 1 - specificity

48

Score increasing

49 of 61

ROC AUC

ROC AUC is the “Area Under the Curve” – a single number that captures the overall quality of the classifier. It should be between 0.5 (random classifier) and 1.0 (perfect).

49

Random ordering

area = 0.5

50 of 61

Training and test set

50

database D

training set

test set

60% of D

40% of D

learn model

evaluate model

performance metric

51 of 61

Leave-one-out cross-validation

51

database D

training set

test set

(N-1)/N of D

1/N of D

learn model

evaluate model

Repeat N times

average of N runs

52 of 61

K-fold cross validation

52

database D

training set

test set

(k-1)/k of D

1/k of D

learn model

evaluate model

average of k runs

Repeat k times

53 of 61

Bias and variance

Bias and variance can be assessed by comparing the error metric on the training set and the test set => always plot learning curves

In the ideal case, we want low bias (small training error) and low variance (small test error)

53

54 of 61

Bias and variance

54

55 of 61

When more data helps

55

High bias

High variance

56 of 61

When more data helps

56

Data > Algorithms

(take it with a grain of salt…:-)

57 of 61

Classification pipeline

57

Data collection

Model assessment

Model selection

58 of 61

Model assessment

Model assessment is the goal of estimating the classification accuracy of a fixed model (i.e., the best model found during model selection)

Evaluate the model using all the dataset:

Training and test
Leave-one-out
K-fold cross-validation

58

59 of 61

Important Reads

59

60 of 61

60

check this slides!

61 of 61

Sources

Aberer / Catasta - DIS 2016
CS194 Berkeley University
Tristan Penman - PWL @ Seattle

61