1 of 61

1

Michele @pirroh Catasta

07 - Applied ML

2 of 61

Announcements/feedback

HW02 grades communicated this week

HW04 delayed one week

Next lab session: hands on ML

I’ve read your project proposals -- great job :-)

2

3 of 61

Classification pipeline

3

Data collection

Model assessment

Model selection

4 of 61

Data collection

The first step is collecting data related to the classification task.

    • Definition of the attributes (or features) that describe a data item and the class label.

Domain knowledge is needed.

What if assigning the class label is time consuming/impossible?

4

5 of 61

Data collection

5

Identification of features

Removing irrelevant features

Unsupervised/supervised discretisation

Discretisation?

Normalisation?

Standardisation/scaling

Y

N

Y

N

Class label available?

Y

Crowdsourcing labelling

N

6 of 61

Features

Different types of features

    • Numerical (e.g., age, temperature ...)
    • Ordinal (e.g., phone code ...)
    • Categorical (e.g., student, weather ...)

New features can be generated from simple stats

    • Feature engineering is considered a form of art, therefore the best suggestion is to find what other people did for similar problems

Some classifiers require categorical features => Discretisation

6

7 of 61

A Brief History of Machine Learning

  • Before 2012*, but still very common today:

* Before publication of Krizhevsky et al.’s ImageNet CNN paper.

7

Cleverly-Designed�Features

Input Data

ML model

Most of the “heavy lifting” in here.

Final performance only as good as the�feature set.

8 of 61

A Brief History of Machine Learning

  • After 2012:

8

Features

Input Data

model

Deep Learning

Features and model learned together,�mutually reinforcing

9 of 61

Data collection

9

Identification of features

Removing irrelevant features

Unsupervised/supervised discretisation

Discretisation?

Normalisation?

Standardisation/scaling

Y

N

Y

N

Class label available?

Y

Crowdsourcing labelling

N

10 of 61

Labels

Collecting a lot of data is easy

Labelling data is time consuming, difficult and sometimes even impossible

10

Expert in diets is needed

11 of 61

11

Requester

Crowdsourcing

platform

1. Submit task

2. Accept task

Crowd (workers)

3. Return answers

4. Collect answers

Is this webpage

credible?

C C ¬C C ¬C

12 of 61

Crowdsourcing

Different types of workers

    • Truthful
      • Expert
      • Normal
    • Untruthful
      • Sloppy
      • Uniform spammer
      • Random spammer

12

Random Spammer

Uniform

Spammer

Sloppy

Worker

Expert

Normal

Worker

True negative rate

True positive rate

Uniform

Spammer

13 of 61

Crowdsourcing

Answer aggregation problem

13

Crowd (workers)

Worker

Webpage

Credible

W1

www.diet.com

C

W2

www.diet.com

¬C

W3

www.diet.com

C

...

...

...

Aggregation

www.diet.com

C

14 of 61

Data collection

14

Identification of features

Removing irrelevant features

Unsupervised/supervised discretisation

Discretisation?

Normalisation?

Standardisation/scaling

Y

N

Y

N

Class label available?

Y

Crowdsourcing labelling

N

15 of 61

Discretisation

Unsupervised

  • Equal width
    • Divide the range into a predefined number of bins
  • Equal frequency
    • Divide the range into a predefined number of bins so that every interval contains the same number of values
  • Clustering

15

16 of 61

Discretisation

Supervised

  • Test the hypothesis that two adjacent intervals of a feature are independent of the class
  • If they are independent, they should be merged
  • Otherwise they should remain separate
  • Independence test: χ2 statistics

16

17 of 61

Data collection

17

Identification of features

Removing irrelevant features

Unsupervised/supervised discretisation

Discretisation?

Normalisation?

Standardisation/scaling

Y

N

Y

N

Class label available?

Y

Crowdsourcing labelling

N

18 of 61

Feature selection

Reducing the number of N features to a subset with the best M < N

There are 2N possible subsets

Solutions

    • Filtering
    • Wrapper

18

19 of 61

Feature selection

Filtering: rank features according to their predictive power and select the best ones

    • Pros
      • Independent of the classifier (performed only once)
    • Cons
      • Independent of the classifier (ignore interaction with the classifier)
      • Assume features are independent

19

20 of 61

Ranking of features

Numerical

    • Correlation coefficient (Pearson, linear rel.!)

    • Mutual information

20

21 of 61

Ranking of features

Categorical

    • χ2 method�Different to correlation, the chi-square test checks the independence of the class and the feature, without indicating the strength or direction of any existing relationship.

21

22 of 61

Ranking of features

Beware of trusting correlations in a blind way

22

23 of 61

23

24 of 61

Ranking of features

Collectively relevant features may look individually irrelevant!

24

25 of 61

Feature selection

Wrapper: iteratively add features, using cross-validation to guide feature inclusion and stopping when no improvement

    • Pros
      • Interact with the classifier
      • No independence assumption
    • Cons
      • Computationally intensive

25

26 of 61

Feature selection

Ablation: iteratively remove features, using cross-validation to guide feature inclusion and stopping when no improvement

    • Pros
      • Interact with the classifier
      • No independence assumption
    • Cons
      • Computationally intensive

26

27 of 61

Example of Feature Count vs. Accuracy

27

28 of 61

Data collection

28

Identification of features

Removing irrelevant features

Unsupervised/supervised discretisation

Discretisation?

Normalisation?

Standardisation/scaling

Y

N

Y

N

Class label available?

Y

Crowdsourcing labelling

N

29 of 61

Feature normalisation

Some classifiers do not manage well features with very different scales

    • Revenue: 10,000,000 (CHF)
    • # of employees: 300

Features with large values dominate the others, and the classifier tend to over-optimise them

29

30 of 61

Standardisation

xi’ = (xi – μi)/σi

where μi is the mean value of feature xi, and σi is the standard deviation

The new feature xi’ has mean 0 and �standard deviation 1

30

31 of 61

Scaling

xi’ = (xi – mi)/(Mi – mi)

where Mi and mi are the max and min values of feature xi respectively

The new feature xi’ lies in the interval [0,1]

31

32 of 61

Standardisation vs Scaling

Standardisation:

    • Assume that the data has been generated by a Gaussian process

Scaling:

    • If the data has outliers, they scale the “normal” values to a very small interval

32

33 of 61

Classification pipeline

33

Data collection

Model assessment

Model selection

34 of 61

34

35 of 61

Model selection

Usually a classifier has some parameters to be tuned

    • Regularisation factor
    • Threshold
    • Distance function
    • Number of neighbours

A performance metric is needed

35

36 of 61

Model selection

36

Split dataset into “training” and “validation”

Evaluate classifier with validation set

Train classifier with training set

Performance acceptable?

Y

N

Set classifier parameters

37 of 61

Loss function

Categorical output

    • 0-1 loss function:

Real value output

    • Squared error:

    • Absolute error:

37

38 of 61

Model selection

38

Parameter value

Loss function

39 of 61

Performance metric for Binary Classification

For categorical binary classification, the usual metrics consider four types of errors

    • True Positive (positive examples classified as positive)
    • True Negative (negative examples classified as negative)
    • False Positive (negative examples classified as positive)
    • False Negative (positive examples classified as negative)

39

Class

A

B

Classified

A

TP

FP

B

FN

TN

40 of 61

Accuracy

appropriate metric when

    • Classes are not skewed
    • Errors have the same importance

40

41 of 61

Accuracy (skewed example)

A = 85/100 = 0.85

A = 90/100 = 0.90

41

Classifier 1

Class

Fraud

¬Fraud

Classified

Fraud

5

10

¬Fraud

5

80

Always ¬Fraud

Class

Fraud

¬Fraud

Classified

Fraud

0

0

¬Fraud

10

90

42 of 61

Question time

Which is the “best” classifier?

            • Classifier 1
            • Classifier 2
            • Both are equally good

42

Classifier 1

Class

A

B

Classified

A

45

20

B

5

30

Classifier 2

Class

A

B

Classified

A

40

10

B

10

40

43 of 61

Question time

Which is the “best” classifier?

            • Classifier 1
            • Classifier 2
            • Both are equally good

43

Classifier 1

Class

Cancer

¬Cancer

Classified

Cancer

45

20

¬Cancer

5

30

Classifier 2

Class

Cancer

¬Cancer

Classified

Cancer

40

10

¬Cancer

10

40

44 of 61

Precision and recall

Precision

Recall

44

45 of 61

Precision and recall

P1=45/65=0.69 P2=40/50=0.8

R1=45/50=0.9 R2=40/50=0.8

45

Classifier 1

Class

Cancer

¬Cancer

Classified

Cancer

45

20

¬Cancer

5

30

Classifier 2

Class

Cancer

¬Cancer

Classified

Cancer

40

10

¬Cancer

10

40

Everybody has cancer

Class

Cancer

¬Cancer

Classified

Cancer

50

50

¬Cancer

0

0

P = 50/100 = 0.5

R = 50/50 = 1

46 of 61

F-score

Sometimes it’s necessary to have a unique metric to compare classifiers

F score (or F1 score)

Precision and Recall can be differently weighted, if one is more important than the other

46

47 of 61

Precision and recall

F1=2*(0.69*0.9)/(0.69+0.9) F2=2*(0.8*0.8)/(0.8+0.8)

= 0.78 =0.8

47

Classifier 1

Class

Cancer

¬Cancer

Classified

Cancer

45

20

¬Cancer

5

30

Classifier 2

Class

Cancer

¬Cancer

Classified

Cancer

40

10

¬Cancer

10

40

Everybody has cancer

Class

Cancer

¬Cancer

Classified

Cancer

50

50

¬Cancer

0

0

F=2*(0.5*1)/(0.5+1)=0.66

48 of 61

ROC plots

ROC is Receiver-Operating Characteristic. ROC plots

Y-axis: true positive rate = tp/(tp + fn), same as recall

X-axis: false positive rate = fp/(fp + tn) = 1 - specificity

48

Score increasing

49 of 61

ROC AUC

ROC AUC is the “Area Under the Curve” – a single number that captures the overall quality of the classifier. It should be between 0.5 (random classifier) and 1.0 (perfect).

49

Random ordering

area = 0.5

50 of 61

Training and test set

50

database D

training set

test set

60% of D

40% of D

learn model

evaluate model

performance metric

51 of 61

Leave-one-out cross-validation

51

database D

training set

test set

(N-1)/N of D

1/N of D

learn model

evaluate model

Repeat N times

average of N runs

52 of 61

K-fold cross validation

52

database D

training set

test set

(k-1)/k of D

1/k of D

learn model

evaluate model

average of k runs

Repeat k times

53 of 61

Bias and variance

Bias and variance can be assessed by comparing the error metric on the training set and the test set => always plot learning curves

In the ideal case, we want low bias (small training error) and low variance (small test error)

53

54 of 61

Bias and variance

54

55 of 61

When more data helps

55

High bias

High variance

56 of 61

When more data helps

56

Data > Algorithms

(take it with a grain of salt…:-)

57 of 61

Classification pipeline

©2016, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis

57

Data collection

Model assessment

Model selection

58 of 61

Model assessment

Model assessment is the goal of estimating the classification accuracy of a fixed model (i.e., the best model found during model selection)

Evaluate the model using all the dataset:

    • Training and test
    • Leave-one-out
    • K-fold cross-validation

58

59 of 61

Important Reads

59

60 of 61

60

check this slides!

61 of 61

Sources

  • Aberer / Catasta - DIS 2016
  • CS194 Berkeley University
  • Tristan Penman - PWL @ Seattle

61