1 of 54

1

Core Concepts in Machine Learning 1

Nikhil Bhagwat

The Neuro, McGill University, Montreal, QC, Canada

nikhil.bhagwat@mcgill.ca

ABCD-ReproNim

ABCD-ReproNim: An ABCD Course on Reproducible Data Analyses

2 of 54

2

Learning Objectives of this Lecture

Define machine-learning nomenclature
Describe basics of the “learning” process
Explain model design choices and performance trade-offs
Introduce model selection and validation frameworks
Explain model performance metrics

ABCD-ReproNim

ABCD-ReproNim: An ABCD Course on Reproducible Data Analyses

ABCD-ReproNim

3 of 54

3

Pop Quiz

Say, currently we have a population with 1% covid prevalence. We train a simple machine-learning model to identify COVID patients using their biometry.

Our model is 91% accurate! Then we also calculate,

90% sensitivity (i.e. probability that prediction is positive if patient has COVID)
91% specificity (i.e. probability that prediction is negative if patient doesn’t have COVID)

4 of 54

4

Pop Quiz

Say, currently we have a population with 1% covid prevalence. We train a simple machine-learning model to identify COVID patients using their biometry.

Our model is 91% accurate! Then we also calculate,

90% sensitivity (i.e. probability that prediction is positive if patient has COVID)
91% specificity (i.e. probability that prediction is negative if patient doesn’t have COVID)

What are my chances that I have COVID, if my test is positive?

9 in 10 B) 1 in 2 C) 1 in 10 D) 1 in 100

5 of 54

5

Pop Quiz

Say, currently we have a population with 1% covid prevalence. We train a simple machine-learning model to identify COVID patients using their biometry.

Our model is 91% accurate! Then we also calculate,

90% sensitivity (i.e. probability that prediction is positive if patient has COVID)
91% specificity (i.e. probability that prediction is negative if patient doesn’t have COVID)

What are my chances that I have COVID, if my test is positive?

9 in 10 B) 1 in 2 C) 1 in 10 D) 1 in 100

Later we train a fancy deep learning model to identify COVID patients using their chest CT! This model has accuracy of 99%! We calculate

80% sensitivity
99% specificity

Which model is better?

Simple B) Fancy

6 of 54

6

Y

X

ML Model

Training a machine-learning model

7 of 54

7

Y

X

ML Model

This is the easy part!

Training a machine-learning model

8 of 54

8

Y

X

ML Model

This was the easy part!

Which features?

ML Model

Which model?

Which performance metrics?

9 of 54

9

Y

X

ML Model

This was the easy part!

Which features?

ML Model

Which model?

Which performance metrics?

How do I validate my design choices?

10 of 54

10

Machine-learning - what, why, and when?

What is Machine learning (ML)?

ML is the study of computer algorithms that improve automatically through experience and by the use of data.

11 of 54

11

Machine-learning - what, why, and when?

What is Machine learning (ML)?

ML is the study of computer algorithms that improve automatically through experience and by the use of data.

Why is it useful - especially in life sciences?

Biology, Medicine, Environmental sciences comprise phenomenons (e.g. a disease) with large number of variables.
We want to model complex relationships within these variables and make accurate predictions.

12 of 54

12

Machine-learning - what, why, and when?

What is Machine learning (ML)?

ML is the study of computer algorithms that improve automatically through experience and by the use of data.

Why is it useful - especially in life sciences?

Biology, Medicine, Environmental sciences comprise phenomenons (e.g. a disease) with large number of variables.
We want to model complex relationships within these variables.

When do I use it?

You are interested in 1) prediction tasks or 2) low-dimensional representation.
You have sufficient data.

13 of 54

13

Input Examples:

features (p)

samples

(n)

labels

Y

X

Model (M)

Output Examples: Clinical measures

Terminology

samples

(n)

14 of 54

14

Outcome	Supervised Learning	Unsupervised Learning
Continuous	Regression	Dimensionality reduction
Categorical	Classification	Clustering

Y

X

Types of ML Algorithms

15 of 54

15

Outcome	Supervised Learning	Unsupervised Learning
Continuous	Regression	Dimensionality reduction
Categorical	Classification	Clustering

Y

X

Types of ML Algorithms

16 of 54

16

Goal: Learn parameters (or weights) of a model (M) that maps X to y

Supervised Learning: Models

17 of 54

17

Goal: Learn parameters (or weights) of a model (M) that maps X to y
Example models:

Linear / Logistic regression

Linear Regression

Supervised Learning: Models

18 of 54

18

Goal: Learn parameters (or weights) of a model (M) that maps X to y
Example models:

Linear / Logistic regression
Support vector machines

Linear Regression

SVM

Supervised Learning: Models

19 of 54

19

Goal: Learn parameters (or weights) of a model (M) that maps X to y
Example models:

Linear / Logistic regression
Support vector machines
Tree-ensembles: random forests, gradient boosting

Linear Regression

SVM

Tree-ensembles

Supervised Learning: Models

20 of 54

20

Goal: Learn parameters (or weights) of a model (M) that maps X to y
Example models:

Linear / Logistic regression
Support vector machines
Tree-ensembles: random forests, gradient boosting
Artificial Neural networks

Linear Regression

SVM

ANN

Tree-ensembles

Supervised Learning: Models

21 of 54

21

How do we learn the model weights?

Example: Linear regression

Model: y = β₀+ β₁x₁+ β₂x₂

Loss function:

Optimization: Gradient descent

Model Fitting

MSE

β₂

β₁

MSE = ⎼ (y_i- ŷ_i)²

1

n

Σ

i = 1

n

22 of 54

22

MSE

Model Fitting

Gradient descent with a single input variable and n samples

Start with random weights (β₀ and β₁)
Compute loss (i.e. MSE)
Update weights based on the gradient

ŷ_i = β₀+ β₁x_i

β₀

β₁

β₀

β₁

MSE = ⎼ (y_i- ŷ_i)²

1

n

Σ

i = 1

n

23 of 54

23

Model Fitting

Gradient descent for complex models with non-convex loss functions

Start with random weights (β₀ and β₁)
Compute loss
Update weights based on the gradient

Local

Minimum

Global

Minimum

More complex models / loss functions (e.g. ANNs)

24 of 54

24

Model Fitting

Can we control this fitting process to get a model with specific characteristics?

25 of 54

25

Model Fitting

Can we control this fitting process to get a model with specific characteristics?

We have strong prior beliefs about what is a plausible model

e.g. I believe this symptom can be predicted with handful of genes.

Practical reasons

Prevent overfitting (n_features >> n_samples)

y = β₀+ β₁x₁+ β₂x₂+ … + β_p_-1x_p_-1+ β_px_p

26 of 54

26

Model Fitting

Can we control this fitting process to get a model with specific characteristic?

We have strong prior beliefs about what is a plausible model

e.g. I believe this symptom can be predicted with handful of genes.

Practical reasons

Prevent overfitting (n_features >> n_samples)

Yes! → Model regularization

27 of 54

27

Model Fitting: Regularization

L1/Lasso: constrains parameters to be sparse

L2/Ridge: constrains parameters to be small

MSE = (y_i- [ β₀+ x_ijβ_j] )² + 𝝀 |β_j|

Σ

i = 1

n

Σ

j = 1

p

Σ

j = 1

p

ŷ_i

L₁

MSE = (y_i- [ β₀+ x_ijβ_j] )² + 𝝀 β_j²

Σ

i = 1

n

Σ

j = 1

p

Σ

j = 1

p

ŷ_i

L₂

How do we do it?

Modify the loss function
Constrain the learning process

Examples:

L1 i.e. Lasso
L2 i.e. Ridge

28 of 54

Model Fitting: Scikit-learn syntax

# import

from sklearn import linear_model, svm

# data

X = [[0, 0], [1, 1]]

y = [0, 1]

# pick a model

model = linear_model.Lasso(alpha=0.1) # model = svm.SVC()

# fit the model with data

model.fit(X, y)

# predict on new data

y_pred = model.predict([[1, 0]])

29 of 54

29

Model Evaluation

Is the model generalizable?
How do we sample train and test sets?
How do we select a model?

Data

(N samples)

Test set

(~ 10% samples)

Train set

(~ 90% samples)

Trained model

Evaluate

Model Fitting

30 of 54

30

Model Evaluation

Is the model generalizable?
How do we sample train and test sets?
How do we select a model?

Data

(N samples)

Test set

(~ 10% samples)

Train set

(~ 90% samples)

Trained model

Evaluate

Model Fitting

31 of 54

31

Model Evaluation

Train performance ≠ Test performance

Model: Underfitting vs Overfitting
Errors: Bias - Variance tradeoff
Regression example

Train set

Overfitting

Optimal

Underfitting

32 of 54

32

Model Evaluation

Train set

Test set

Overfitting

Optimal

Underfitting

Train performance ≠ Test performance

Model: Underfitting vs Overfitting
Errors: Bias - Variance tradeoff
Regression example

33 of 54

33

Model Evaluation

X₂

Underfitting

X₁

X₂

Optimal

X₁

X₂

Overfitting

X₁

Train class_1

Train class_2

Train performance ≠ Test performance

Model: Underfitting vs Overfitting
Errors: Bias - Variance tradeoff
Classification example

34 of 54

34

Model Evaluation

Test class_1

Overfitting

Optimal

Underfitting

X₁

X₂

X₁

X₂

X₁

X₂

Train class_1

Train class_2

Test class_2

Train performance ≠ Test performance

Model: Underfitting vs Overfitting
Errors: Bias - Variance tradeoff
Classification example

35 of 54

35

Model Evaluation

Is the model generalizable?
How do we sample train and test sets?
How do we select a model?

Data

(N samples)

Test set

(~ 10% samples)

Train set

(~ 90% samples)

Trained model

Evaluate

Model Fitting

36 of 54

36

All data

Train data

Model Evaluation: Cross-Validation (Outer loop)

Test data

How do we sample train and test sets?

Train set: learn model parameters
Test set (a.k.a held-out sample): Evaluate model performance

37 of 54

37

All data

Train data

Test data

How do we sample train and test sets?

Train set: learn model parameters
Test set (a.k.a held-out sample): Evaluate model performance
Repeat for different Train-Test splits

k-fold, shuffle-split

Report performance statistics over all test folds

Train data

Test data

Train data

Fold 1

Fold 2

Fold 3

Fold 4

Fold 5

Train data

Test data

Train data

Test data

Train data

Test data

Train data

CV outer loop

Model Evaluation: Cross-Validation (Outer loop)

38 of 54

38

Model Evaluation

Is the model generalizable?
How do we sample train and test sets?
How do we select a model?

Data

(N samples)

Test set

(~ 10% samples)

Train set

(~ 90% samples)

Trained model

Evaluate

Model Fitting

39 of 54

39

Train data

All data

Split 1

val data

train

Fold 1

Fold 2

Fold 3

Fold 4

Test data

How do we select a model?

Tune hyper-parameters of a model
Compare several different model architectures
Select / transform raw features

This repeats for all train-test splits in the outer loop

Split 2

val data

train

Split 3

train

val data

Split 4

val data

CV inner loop

Model Evaluation: Cross-Validation (Inner loop)

40 of 54

40

Model Evaluation: Hyper-parameters

Hyper-parameter ≠ parameter (or weights)

Parameters are learned; hyper-parameters are chosen!

41 of 54

41

Hyper-parameter ≠ parameter (or weights)

Parameters are learned; hyper-parameters are chosen!

Examples:

Degree of model (eg. linear vs quadratic)
Kernels
Number of trees
Number of layers, filters, batch-size, learning-rate in ANNs

Model Evaluation: Hyper-parameters

42 of 54

42

Hyper-parameter ≠ parameter (or weights)

Parameters are learned; hyper-parameters are chosen!

Examples:

Degree of model (eg. linear vs quadratic)
Kernels
Number of trees
Number of layers, filters, batch-size, learning-rate in ANNs

How do we choose them?

Prior beliefs → eg. cortical thickness and age have quadratic relationship.
Arbitrarily → we gotta start with something!
Trial and error → do a computationally feasible grid-search.

Model Evaluation: Hyper-parameters

43 of 54

43

Performance Scores

Loss functions → computationally well-suited metrics

May / need not completely capture performance metrics of interest

Scores → practically useful metrics

Binary classification

		Ground Truth
		POSITIVE	NEGATIVE
Prediction	POSITIVE
Prediction	NEGATIVE

TP

TN

FP

FN

Confusion

Matrix

False Positive

False Negative

44 of 54

44

Performance Scores

ML model that detects Covid from chest CTs. Current Covid prevalence ~ 1%.

FP: model predicts Covid when person is healthy
FN: model predicts healthy when person has Covid

What happens if we build model that predicts everyone as healthy?

i.e. zero FPs!

45 of 54

45

Performance Scores

Score	Formula	Null	What does it tell us?	When do I use it?
Accuracy	(TP+TN) / (TP+FP+FN+TN)	0.99	How many people did we correctly predict out of all the people scanned?	FNs & FPs have similar costs
Precision (i.e. PPV)	TP/(TP+FP)	NaN	How many of those who we predicted as “covid” do actually have “covid”?	If you want to be more confident of your TPs
Recall (aka Sensitivity)	TP/(TP+FN)	0	Of all the people who have covid, how many of those did we correctly predict?	If you prefer FPs over FNs.
Specificity	TN/(TN+FP)	1	Of all the people who are healthy, how many of those did we correctly predict?	If you prefer FNs over FPs.
F1	2(Recall Precision) / (Recall + Precision)	NaN	Harmonic mean(average) of the precision and recall.	When you have an uneven class distribution

ML model that detects Covid from chest CTs. Current Covid prevalence ~ 1%.

FP: model predicts Covid when person is healthy
FN: model predicts healthy when person has Covid

What happens if we build model that predicts everyone as healthy?

46 of 54

46

Pop Quiz Answers

We train a simple machine-learning model to identify COVID patients using their biometry, in a population with 1% covid prevalence. Our model is 91% accurate! Then we also calculate,

90% sensitivity (i.e. probability that prediction is positive if patient has COVID)
91% specificity (i.e. probability that prediction is negative if patient doesn’t have COVID)

What are my chances that I have COVID if my test is positive?

(Imagine a sample of 1000 individuals → 10 COVID patients → 9 TP & 89 FP)

9 in 10 B) 1 in 2 C) 1 in 10 D) 1 in 100

Later we train a fancy deep Learning model to identify COVID patients using their chest CT! This model has accuracy of 99%! We calculate

80% sensitivity
99% specificity

Which model is better? (We want to avoid FN to reduce the spread → we want high-sensitivity)

Simple B) Fancy

47 of 54

47

Performance Curves

Receiver Operating Characteristic (ROC) → Want high area-under-the-curve (AUC)
Precision-Recall → Want high AUC or high Average precision (AP)

48 of 54

48

Deep-learning

Why the buzz?

Works amazing on structured input
Highly flexible → universal function approximator

What are the challenges?

Large number of parameters → data hungry
Large number of hyper-parameters → difficult to train

When do I use it?

If you have highly-structured input, eg. medical images.
You have a lot of data and computational resources.

ANN for handwritten-digit images

(gif source: 3b1b)

49 of 54

49

Pitfalls and Challenges

Models do not generalize even after good CV performance

Implicit double-dipping
Dataset biases (eg. North-American demographics)
Noisy labels (eg. diagnosis definitions)
Data distribution shifts (eg. assay, scanner upgrades)

50 of 54

50

Pitfalls and Challenges

Models do not generalize even after good CV performance

Implicit double-dipping
Dataset biases (eg. North-American demographics)
Noisy labels (eg. diagnosis definitions)
Data distribution shifts (eg. assay, scanner upgrades)

Unnecessary complexity

Do I really need a giant deep-net or a simple linear model would do?

51 of 54

51

ML Novice Checklist

Data

What is my n_features and n_samples?
Am I encoding categorical data correctly?
Am I using information (e.g. mean) from test set to preprocess (eg. zscore) the data?

52 of 54

52

ML Novice Checklist

Data

What is my n_features and n_samples?
Am I encoding categorical data correctly?
Am I using information (e.g. mean) from test set to preprocess (eg. zscore) the data?

Model

Do my performance metrics capture the practical use-case of interest?
What is the null / dummy model performance?

Classification: Predict majority class all the time
Regression: Predict the median value all the time

Am I interpreting model parameters (i.e. weights) correctly?

53 of 54

53

Supervised models are useful for predictions

eg. image segmentation, prognosis, drug development

Our job is to ensure generalizability of these models

Multitude of validations
Understanding model biases and limitations

Food for thought: engineering tools vs scientific discovery

Interpretability and explainability
Causality, reliability, fairness

Takeaways

It’s Covid

because…

Explainable AI

54 of 54

54

Core Concepts in Machine Learning 1

Nikhil Bhagwat

The Neuro, McGill University, Montreal, QC, Canada

nikhil.bhagwat@mcgill.ca

ABCD-ReproNim

ABCD-ReproNim: An ABCD Course on Reproducible Data Analyses