1 of 54

1

Core Concepts in Machine Learning 1

Nikhil Bhagwat

The Neuro, McGill University, Montreal, QC, Canada

nikhil.bhagwat@mcgill.ca

ABCD-ReproNim

ABCD-ReproNim: An ABCD Course on Reproducible Data Analyses

2 of 54

2

Learning Objectives of this Lecture

  • Define machine-learning nomenclature
  • Describe basics of the “learning” process
  • Explain model design choices and performance trade-offs
  • Introduce model selection and validation frameworks
  • Explain model performance metrics

ABCD-ReproNim

ABCD-ReproNim: An ABCD Course on Reproducible Data Analyses

ABCD-ReproNim

3 of 54

3

Pop Quiz

Say, currently we have a population with 1% covid prevalence. We train a simple machine-learning model to identify COVID patients using their biometry.

Our model is 91% accurate! Then we also calculate,

  • 90% sensitivity (i.e. probability that prediction is positive if patient has COVID)
  • 91% specificity (i.e. probability that prediction is negative if patient doesn’t have COVID)

4 of 54

4

Pop Quiz

Say, currently we have a population with 1% covid prevalence. We train a simple machine-learning model to identify COVID patients using their biometry.

Our model is 91% accurate! Then we also calculate,

  • 90% sensitivity (i.e. probability that prediction is positive if patient has COVID)
  • 91% specificity (i.e. probability that prediction is negative if patient doesn’t have COVID)

What are my chances that I have COVID, if my test is positive?

  1. 9 in 10 B) 1 in 2 C) 1 in 10 D) 1 in 100

5 of 54

5

Pop Quiz

Say, currently we have a population with 1% covid prevalence. We train a simple machine-learning model to identify COVID patients using their biometry.

Our model is 91% accurate! Then we also calculate,

  • 90% sensitivity (i.e. probability that prediction is positive if patient has COVID)
  • 91% specificity (i.e. probability that prediction is negative if patient doesn’t have COVID)

What are my chances that I have COVID, if my test is positive?

  • 9 in 10 B) 1 in 2 C) 1 in 10 D) 1 in 100

Later we train a fancy deep learning model to identify COVID patients using their chest CT! This model has accuracy of 99%! We calculate

  • 80% sensitivity
  • 99% specificity

Which model is better?

  1. Simple B) Fancy

6 of 54

6

Y

X

ML Model

Training a machine-learning model

7 of 54

7

Y

X

ML Model

This is the easy part!

Training a machine-learning model

8 of 54

8

Y

X

ML Model

This was the easy part!

Which features?

ML Model

ML Model

ML Model

Which model?

Which performance metrics?

9 of 54

9

Y

X

ML Model

This was the easy part!

Which features?

ML Model

ML Model

ML Model

Which model?

Which performance metrics?

How do I validate my design choices?

10 of 54

10

Machine-learning - what, why, and when?

  • What is Machine learning (ML)?
    • ML is the study of computer algorithms that improve automatically through experience and by the use of data.

11 of 54

11

Machine-learning - what, why, and when?

  • What is Machine learning (ML)?
    • ML is the study of computer algorithms that improve automatically through experience and by the use of data.

  • Why is it useful - especially in life sciences?
    • Biology, Medicine, Environmental sciences comprise phenomenons (e.g. a disease) with large number of variables.
    • We want to model complex relationships within these variables and make accurate predictions.

12 of 54

12

Machine-learning - what, why, and when?

  • What is Machine learning (ML)?
    • ML is the study of computer algorithms that improve automatically through experience and by the use of data.

  • Why is it useful - especially in life sciences?
    • Biology, Medicine, Environmental sciences comprise phenomenons (e.g. a disease) with large number of variables.
    • We want to model complex relationships within these variables.

  • When do I use it?
    • You are interested in 1) prediction tasks or 2) low-dimensional representation.
    • You have sufficient data.

13 of 54

13

Input Examples:

features (p)

samples

(n)

labels

Y

X

Model (M)

Output Examples: Clinical measures

Terminology

samples

(n)

14 of 54

14

Outcome

Supervised Learning

Unsupervised Learning

Continuous

Regression

Dimensionality reduction

Categorical

Classification

Clustering

Y

X

Types of ML Algorithms

15 of 54

15

Outcome

Supervised Learning

Unsupervised Learning

Continuous

Regression

Dimensionality reduction

Categorical

Classification

Clustering

Y

X

Types of ML Algorithms

16 of 54

16

  • Goal: Learn parameters (or weights) of a model (M) that maps X to y

Supervised Learning: Models

17 of 54

17

  • Goal: Learn parameters (or weights) of a model (M) that maps X to y
  • Example models:
    • Linear / Logistic regression

Linear Regression

Supervised Learning: Models

18 of 54

18

  • Goal: Learn parameters (or weights) of a model (M) that maps X to y
  • Example models:
    • Linear / Logistic regression
    • Support vector machines

Linear Regression

SVM

Supervised Learning: Models

19 of 54

19

  • Goal: Learn parameters (or weights) of a model (M) that maps X to y
  • Example models:
    • Linear / Logistic regression
    • Support vector machines
    • Tree-ensembles: random forests, gradient boosting

Linear Regression

SVM

Tree-ensembles

Supervised Learning: Models

20 of 54

20

  • Goal: Learn parameters (or weights) of a model (M) that maps X to y
  • Example models:
    • Linear / Logistic regression
    • Support vector machines
    • Tree-ensembles: random forests, gradient boosting
    • Artificial Neural networks

Linear Regression

SVM

ANN

Tree-ensembles

Supervised Learning: Models

21 of 54

21

  • How do we learn the model weights?

    • Example: Linear regression

    • Model: y = β0 + β1 x1 + β2x2

    • Loss function:

    • Optimization: Gradient descent

Model Fitting

MSE

β2

β1

MSE = ⎼ (yi - ŷi )2

1

n

Σ

i = 1

n

22 of 54

22

MSE

MSE

Model Fitting

    • Gradient descent with a single input variable and n samples
      • Start with random weights (β0 and β1)
      • Compute loss (i.e. MSE)
      • Update weights based on the gradient

ŷi = β0 + β1 xi

β0

β1

β0

β1

MSE = ⎼ (yi - ŷi )2

1

n

Σ

i = 1

n

23 of 54

23

Model Fitting

    • Gradient descent for complex models with non-convex loss functions
      • Start with random weights (β0 and β1)
      • Compute loss
      • Update weights based on the gradient

Local

Minimum

Global

Minimum

More complex models / loss functions (e.g. ANNs)

24 of 54

24

Model Fitting

    • Can we control this fitting process to get a model with specific characteristics?

25 of 54

25

Model Fitting

    • Can we control this fitting process to get a model with specific characteristics?
      • We have strong prior beliefs about what is a plausible model
        • e.g. I believe this symptom can be predicted with handful of genes.
      • Practical reasons
        • Prevent overfitting (n_features >> n_samples)

y = β0 + β1 x1 + β2x2 + … + βp-1 xp-1 + βpxp

26 of 54

26

Model Fitting

    • Can we control this fitting process to get a model with specific characteristic?
      • We have strong prior beliefs about what is a plausible model
        • e.g. I believe this symptom can be predicted with handful of genes.
      • Practical reasons
        • Prevent overfitting (n_features >> n_samples)

    • Yes! → Model regularization

27 of 54

27

Model Fitting: Regularization

  1. L1/Lasso: constrains parameters to be sparse

  • L2/Ridge: constrains parameters to be small

MSE = (yi - [ β0 + xij βj ] )2 + 𝝀 |βj|

Σ

i = 1

n

Σ

j = 1

p

Σ

j = 1

p

ŷi

L1

MSE = (yi - [ β0 + xij βj ] )2 + 𝝀 βj2

Σ

i = 1

n

Σ

j = 1

p

Σ

j = 1

p

ŷi

L2

    • How do we do it?
      • Modify the loss function
      • Constrain the learning process

    • Examples:
      • L1 i.e. Lasso
      • L2 i.e. Ridge

28 of 54

Model Fitting: Scikit-learn syntax

# import

from sklearn import linear_model, svm

# data

X = [[0, 0], [1, 1]]

y = [0, 1]

# pick a model

model = linear_model.Lasso(alpha=0.1) # model = svm.SVC()

# fit the model with data

model.fit(X, y)

# predict on new data

y_pred = model.predict([[1, 0]])

29 of 54

29

Model Evaluation

    • Is the model generalizable?
    • How do we sample train and test sets?
    • How do we select a model?

Data

(N samples)

Test set

(~ 10% samples)

Train set

(~ 90% samples)

Trained model

Evaluate

Model Fitting

30 of 54

30

Model Evaluation

    • Is the model generalizable?
    • How do we sample train and test sets?
    • How do we select a model?

Data

(N samples)

Test set

(~ 10% samples)

Train set

(~ 90% samples)

Trained model

Evaluate

Model Fitting

31 of 54

31

Model Evaluation

    • Train performance ≠ Test performance
      • Model: Underfitting vs Overfitting
      • Errors: Bias - Variance tradeoff
      • Regression example

 

 

 

 

 

 

Train set

Overfitting

Optimal

Underfitting

32 of 54

32

Model Evaluation

 

 

 

 

 

 

Train set

Test set

Overfitting

Optimal

Underfitting

    • Train performance ≠ Test performance
      • Model: Underfitting vs Overfitting
      • Errors: Bias - Variance tradeoff
      • Regression example

33 of 54

33

Model Evaluation

X2

Underfitting

X1

X2

Optimal

X1

X2

Overfitting

X1

Train class_1

Train class_2

    • Train performance ≠ Test performance
      • Model: Underfitting vs Overfitting
      • Errors: Bias - Variance tradeoff
      • Classification example

34 of 54

34

Model Evaluation

Test class_1

Overfitting

Optimal

Underfitting

X1

X2

X1

X2

X1

X2

Train class_1

Train class_2

Test class_2

    • Train performance ≠ Test performance
      • Model: Underfitting vs Overfitting
      • Errors: Bias - Variance tradeoff
      • Classification example

35 of 54

35

Model Evaluation

    • Is the model generalizable?
    • How do we sample train and test sets?
    • How do we select a model?

Data

(N samples)

Test set

(~ 10% samples)

Train set

(~ 90% samples)

Trained model

Evaluate

Model Fitting

36 of 54

36

All data

Train data

Model Evaluation: Cross-Validation (Outer loop)

Test data

    • How do we sample train and test sets?
      • Train set: learn model parameters
      • Test set (a.k.a held-out sample): Evaluate model performance

37 of 54

37

All data

Train data

Test data

    • How do we sample train and test sets?
      • Train set: learn model parameters
      • Test set (a.k.a held-out sample): Evaluate model performance
      • Repeat for different Train-Test splits
        • k-fold, shuffle-split
      • Report performance statistics over all test folds

Train data

Test data

Train data

Fold 1

Fold 2

Fold 3

Fold 4

Fold 5

Train data

Test data

Train data

Train data

Test data

Train data

Test data

Train data

CV outer loop

Model Evaluation: Cross-Validation (Outer loop)

38 of 54

38

Model Evaluation

    • Is the model generalizable?
    • How do we sample train and test sets?
    • How do we select a model?

Data

(N samples)

Test set

(~ 10% samples)

Train set

(~ 90% samples)

Trained model

Evaluate

Model Fitting

39 of 54

39

Train data

All data

Split 1

val data

train

Fold 1

Fold 2

Fold 3

Fold 4

Test data

    • How do we select a model?
      • Tune hyper-parameters of a model
      • Compare several different model architectures
      • Select / transform raw features

    • This repeats for all train-test splits in the outer loop

Split 2

val data

train

train

Split 3

train

train

val data

Split 4

val data

CV inner loop

Model Evaluation: Cross-Validation (Inner loop)

40 of 54

40

Model Evaluation: Hyper-parameters

    • Hyper-parameter ≠ parameter (or weights)
      • Parameters are learned; hyper-parameters are chosen!

41 of 54

41

    • Hyper-parameter ≠ parameter (or weights)
      • Parameters are learned; hyper-parameters are chosen!

    • Examples:
      • Degree of model (eg. linear vs quadratic)
      • Kernels
      • Number of trees
      • Number of layers, filters, batch-size, learning-rate in ANNs

Model Evaluation: Hyper-parameters

42 of 54

42

    • Hyper-parameter ≠ parameter (or weights)
      • Parameters are learned; hyper-parameters are chosen!

    • Examples:
      • Degree of model (eg. linear vs quadratic)
      • Kernels
      • Number of trees
      • Number of layers, filters, batch-size, learning-rate in ANNs

    • How do we choose them?
      • Prior beliefs → eg. cortical thickness and age have quadratic relationship.
      • Arbitrarily → we gotta start with something!
      • Trial and error → do a computationally feasible grid-search.

Model Evaluation: Hyper-parameters

43 of 54

43

Performance Scores

    • Loss functions → computationally well-suited metrics
      • May / need not completely capture performance metrics of interest

    • Scores → practically useful metrics
      • Binary classification

Ground Truth

POSITIVE

NEGATIVE

Prediction

POSITIVE

NEGATIVE

TP

TN

FP

FN

Confusion

Matrix

False Positive

False Negative

44 of 54

44

Performance Scores

    • ML model that detects Covid from chest CTs. Current Covid prevalence ~ 1%.
      • FP: model predicts Covid when person is healthy
      • FN: model predicts healthy when person has Covid
    • What happens if we build model that predicts everyone as healthy?
      • i.e. zero FPs!

45 of 54

45

Performance Scores

Score

Formula

Null

What does it tell us?

When do I use it?

Accuracy

(TP+TN) / (TP+FP+FN+TN)

0.99

How many people did we correctly predict out of all the people scanned?

FNs & FPs have similar costs

Precision

(i.e. PPV)

TP/(TP+FP)

NaN

How many of those who we predicted as “covid” do actually have “covid”?

If you want to be more confident of your TPs

Recall (aka Sensitivity)

TP/(TP+FN)

0

Of all the people who have covid, how many of those did we correctly predict?

If you prefer FPs over FNs.

Specificity

TN/(TN+FP)

1

Of all the people who are healthy, how many of those did we correctly predict?

If you prefer FNs over FPs.

F1

2*(Recall * Precision) / (Recall + Precision)

NaN

Harmonic mean(average) of the precision and recall.

When you have an uneven class distribution

    • ML model that detects Covid from chest CTs. Current Covid prevalence ~ 1%.
      • FP: model predicts Covid when person is healthy
      • FN: model predicts healthy when person has Covid
    • What happens if we build model that predicts everyone as healthy?

46 of 54

46

Pop Quiz Answers

We train a simple machine-learning model to identify COVID patients using their biometry, in a population with 1% covid prevalence. Our model is 91% accurate! Then we also calculate,

  • 90% sensitivity (i.e. probability that prediction is positive if patient has COVID)
  • 91% specificity (i.e. probability that prediction is negative if patient doesn’t have COVID)

What are my chances that I have COVID if my test is positive?

(Imagine a sample of 1000 individuals → 10 COVID patients → 9 TP & 89 FP)

  • 9 in 10 B) 1 in 2 C) 1 in 10 D) 1 in 100

Later we train a fancy deep Learning model to identify COVID patients using their chest CT! This model has accuracy of 99%! We calculate

  • 80% sensitivity
  • 99% specificity

Which model is better? (We want to avoid FN to reduce the spread → we want high-sensitivity)

  • Simple B) Fancy

47 of 54

47

Performance Curves

    • Receiver Operating Characteristic (ROC) → Want high area-under-the-curve (AUC)
    • Precision-Recall → Want high AUC or high Average precision (AP)

48 of 54

48

Deep-learning

    • Why the buzz?
      • Works amazing on structured input
      • Highly flexible → universal function approximator

    • What are the challenges?
      • Large number of parameters → data hungry
      • Large number of hyper-parameters → difficult to train

    • When do I use it?
      • If you have highly-structured input, eg. medical images.
      • You have a lot of data and computational resources.

ANN for handwritten-digit images

(gif source: 3b1b)

49 of 54

49

Pitfalls and Challenges

    • Models do not generalize even after good CV performance
      • Implicit double-dipping
      • Dataset biases (eg. North-American demographics)
      • Noisy labels (eg. diagnosis definitions)
      • Data distribution shifts (eg. assay, scanner upgrades)

50 of 54

50

Pitfalls and Challenges

    • Models do not generalize even after good CV performance
      • Implicit double-dipping
      • Dataset biases (eg. North-American demographics)
      • Noisy labels (eg. diagnosis definitions)
      • Data distribution shifts (eg. assay, scanner upgrades)

    • Unnecessary complexity
      • Do I really need a giant deep-net or a simple linear model would do?

51 of 54

51

ML Novice Checklist

    • Data
      • What is my n_features and n_samples?
      • Am I encoding categorical data correctly?
      • Am I using information (e.g. mean) from test set to preprocess (eg. zscore) the data?

52 of 54

52

ML Novice Checklist

    • Data
      • What is my n_features and n_samples?
      • Am I encoding categorical data correctly?
      • Am I using information (e.g. mean) from test set to preprocess (eg. zscore) the data?

    • Model
      • Do my performance metrics capture the practical use-case of interest?
      • What is the null / dummy model performance?
        • Classification: Predict majority class all the time
        • Regression: Predict the median value all the time
      • Am I interpreting model parameters (i.e. weights) correctly?

53 of 54

53

    • Supervised models are useful for predictions
      • eg. image segmentation, prognosis, drug development

    • Our job is to ensure generalizability of these models
      • Multitude of validations
      • Understanding model biases and limitations

    • Food for thought: engineering tools vs scientific discovery
      • Interpretability and explainability
      • Causality, reliability, fairness

Takeaways

It’s Covid

because…

Explainable AI

54 of 54

54

Core Concepts in Machine Learning 1

Nikhil Bhagwat

The Neuro, McGill University, Montreal, QC, Canada

nikhil.bhagwat@mcgill.ca

ABCD-ReproNim

ABCD-ReproNim: An ABCD Course on Reproducible Data Analyses