2 of 42

Machine learning

What is machine learning? Building models that learn inherent properties of data

What are the models ‘learning’? Patterns that are generalizable in order to gain insight into the data and/or make predictions about new data

Machine learning is not statistics - statistics performs inference on a population from a sample, whereas machine learning makes predictions based on patterns that it learns. Machine learning is a type of artificial intelligence

3 of 42

Machine learning algorithms

Many different options for machine learning

4 of 42

Machine learning

In machine learning you build a model from some input (i.e. training) data, and then use that model to predict the outcome of new data

Some types of machine learning questions:

Classification - predict whether data is one pre-defined category or another

Regression - predict whether data has a specific quantity

Clustering - predict whether data is in a category defined from the data itself

Dimension reduction - determine the high-level features of data

5 of 42

Machine learning

In machine learning you build a model from input (i.e. training) data, and can then use that model to predict the outcome of new data

Some types of machine learning:

Supervised - pre-specify the outcomes (labels) of the training data, and then learn patterns that separate the data - can be categorical (classification) or continuous (regression)

Unsupervised - no pre-specification of outcomes for the training data, learn patterns that describe data as groups (e.g. clusters) or quantities (e.g. dimension reduction)

Semi-supervised - some of the training data is labeled and the rest isn’t, learn patterns for the unlabeled data using the labeled data

Reinforcement learning

6 of 42

Machine learning

Supervised model - identifying patterns that separate labeled, categorical data

7 of 42

Machine learning

Unsupervised model - identify structure without using any pre-specified labels

8 of 42

Machine learning algorithms

Which machine learning method should I use?

Some considerations:

1. What question I am asking?

Do I want to predict categories or values from labeled data? Classify two categories or multiple categories? Or do I want to learn structure in unlabeled data?

2. Are there constraints on speed, memory and computing needed to create the model?

Models such as linear regression and Naive Bayes are fast, whereas neural nets can often be quite time consuming and may also need specialized computing (e.g. GPUs)

3. How much training data do I have?

Certain machine learning models work better with large amounts of training data - e.g. neural networks - whereas others can work well even with very little training data - e.g. naive Bayes

4. How interpretable do I want the resulting model?

Do I want to know the underlying features that contributed to a prediction - or is that not important? E.g. linear regression, naive Bayes provide interpretable features, whereas neural nets etc. might not

9 of 42

Machine learning algorithms

Types of machine learning algorithms:

Supervised:

Linear regression, logistic regression, LASSO/Elastic Net, Support vector machine (SVM), Naive Bayes, Random Forest/Gradient boosting, K-nearest neighbors (KNN), Neural networks (CNN)

Unsupervised:

Clustering (K-means, Hierarchical), dimension reduction (PCA, tSNE, SVD)

Semi-supervised:

Label spreading, self-learning, semi-supervised support vector machine (S3VM), graph-based, co-training

Many are implemented in R - if you end up doing a lot of machine learning though you likely would end up using python

10 of 42

Machine learning algorithms

11 of 42

Machine learning algorithms

12 of 42

Machine learning - dimension reduction

Dimension reduction is learning a lower dimension representation of high dimensional data that retains some of the most prominent patterns in the data

Principal components analysis (PCA)

Eigenvectors derived from a covariance matrix (e.g. pairwise combinations of covariance between measurements) of the data

Eigenvectors (PCs) are then sorted based on variance- the top PCs are the ones explaining most of the variance in the data

Others: SVD, LSI/LSA

13 of 42

Machine learning - dimension reduction

Dimension reduction is learning a lower dimension representation of high dimensional data that retains some of the most prominent patterns in the data

PCA has a couple limitations - including that it assumes linear relationships between the data

Non-linear dimension reduction techniques include things you’ve probably also heard of - tSNE, UMAP etc.

14 of 42

Machine learning algorithms

15 of 42

Machine learning algorithms

Some commonly used ML algorithms for classification

Decision trees/Random forest

Decision trees are simple representations of classifiers. Random forests work by creating many many different decision trees and returning the class or value with the majority of the vote across individual trees

Decision tree:

16 of 42

Machine learning algorithms

Some commonly used ML algorithms for classification

Decision trees/Random forest

Strengths:

Feature importance - easier to extract/understand
Non-parametric - no assumption about underlying data
Robustness to noise
Missing data
Handles any type of data - numerical/categorical etc.

Weaknesses:

Overfitting
Memory intensive

17 of 42

Machine learning algorithms

Some commonly used ML algorithms for classification

Support vector machine

Used for classifying two categories by identifying a hyperplane that maximizes the distance between the marginal samples (support vectors) for each class

18 of 42

Machine learning algorithms

Some commonly used ML algorithms for classification

Naive Bayes classifier

Uses Bayes theorem determine probabilities that each feature will belong to each class based on training data and then classify new data based on these probabilities - the ‘naive’ part comes from assuming that each feature is independent. Simple but surprisingly effective (and interpretable)

Test accuracy=.95, COVID rate=.05

What is the probability I have COVID if I have a positive test?

19 of 42

Machine learning algorithms

Some commonly used ML algorithms for classification

Naive Bayes classifier

P(COVID|Positive test) = .95 * .05/(.05*.95 + .95*.05) = .50

Test accuracy=.95, COVID rate=.05

What is the probability I have COVID if I have a positive test?

20 of 42

Machine learning algorithms

Considerations - pre-processing data

Scaling data: Some machine learning algorithms perform better when the data is first scaled and centered prior to input to model training

Feature selection: Some machine learning algorithms can make use of all available features in building models. Other algorithms benefit from sub-selecting features before building the model - e.g. remove uninformative features, or to prune out redundant features

21 of 42

Machine learning algorithms

Considerations - hyperparameters

Hyperparameters are parameters of the algorithm that impact the ability of the model to learn from data - can be optimized as part of training to improve model performance

Distinct from the parameters learned by the model which can’t be changed

Different algorithms have different hyperparameters which can be modified in training

For example - the C parameter in a SVM determines how much misclassification of training data you want to avoid

22 of 42

Training supervised models

How do we evaluate performance of a model with labeled outcomes?

When predicting continuous values e.g. in regression - we can use the R-squared mean error (RSME) which we covered previously - correlation between actual and predicted values

For a classification problem - we can compare predicted labels to the ‘true’ labels to determine whether the model predictions are accurate

Confusion matrix:

	Actual Cat	Not actual cat
Predicted Cat	True positive (TP)	False positive (FP)
Not predicted Cat	False negative (FN)	True Negative (TN)

23 of 42

Training supervised models

How do we evaluate performance of a supervised model?

For a classification problem - we can compare predicted labels to the ‘true’ labels to determine whether the model predictions are accurate

Confusion matrix:

	Actual Cat	Not actual Cat
Predicted Cat	True positive (TP)	False positive (FP)
Not predicted Cat	False negative (FN)	True Negative (TN)

Accuracy = TP+TN / All

What percentage of all predictions were correct?

Sensitivity/recall = TP/(TP+FN)

What percentage of true positives were correct?

Specificity = TN/(TN+FP)

What percentage of true negatives were correct?

Precision = TP/(TP+FP)

What percentage of predicted positives were correct?

24 of 42

Training supervised models

How do we evaluate performance of a supervised model?

For a classification problem - we can compare predicted labels to the ‘true’ labels to determine whether the model predictions are accurate

Confusion matrix:

	Actual Cat	Not actual cat
Predicted Cat	True positive (TP)	False positive (FP)
Not predicted Cat	False negative (FN)	True Negative (TN)

F1 = 2 * (precision * recall)/

(precision + recall)

Sensitivity/recall = TP/(TP+FN)

What percentage of true positives were correct?

Precision = TP/(TP+FP)

What percentage of predicted positives were correct?

One measure to summarize both precision and recall:

25 of 42

Receiver operating characteristic (ROC)

Visual representation of a model performance by comparing the true positive rate (sensitivity) and the false positive rate (1-specificity)

26 of 42

Receiver operating characteristic (ROC)

When classifying data - the threshold used to discriminate positive or negative values could vary. A ROC curve shows changes in TPR/FPR as a function of this discrimination threshold

At one extreme discrimination threshold all values are predicted as negative so the TPR and FPR are both 0, and at the other extreme threshold TPR/FPR are both 1

But in between is TPR generally recovered at a faster rate than FPR? And at what threshold is this difference greatest?

27 of 42

Receiver operating characteristic (ROC)

When classifying data - the threshold used to discriminate positive or negative values could vary. A ROC curve shows changes in TPR/FPR as a function of this discrimination threshold

Given a ROC curve it is possible to then calculate the ‘area’ under the curve (AUC)

A perfect classifier has an AUC of 1 - a random classifier has an AUC of .5

We therefore hope for an AUC greater than .5 (random) - the higher the better the classifier

28 of 42

Training supervised models

We can train and evaluate a model using labeled data... but how do we know our model is any good at actually predicting things in the real world?

One issue in machine learning is ‘overfitting’ - what is overfitting?

29 of 42

Training supervised models

We can train and evaluate a model using labeled data... but how do we know our model is any good at actually predicting things in the real world?

One issue in machine learning is ‘overfitting’ - meaning that when models are applied to independent data they have much lower predictive power than in training. In other words - the models are really good at predicting training data, but then don’t have much real world application

We would ideally like to evaluate the performance of our model on independent labeled data to give a truer sense of performance - however, depending our our question, it might not always be possible to have access to additional data with labels

What can we do?

30 of 42

Cross-validation

When training a model - we can ‘leave-out’ or randomly set aside some of the training data to save as test data

Say we use 80% of the data as a training set and 20% as a test set - then we can make predictions using the test set as ‘independent’ data not used in building the model.

However, this might produce different results depending on the split. More ideal is to be able to still use all of the training data as part of this procedure

Cross-validation separates the training data into multiple ‘splits’ or ‘folds’ - and then performs training/testing across the different splits

3-fold cross-validation:

31 of 42

Cross-validation

Cross-validation separates the training data into multiple ‘splits’ or ‘folds’ - and then performs training/testing across the different splits

3-fold cross-validation:

Based on the parameters of all models from cross-validation a final model is then finally fit to all of the data using these parameters

Typically an independent test set is also used to evaluate model performance

32 of 42

Machine learning in R

There are numerous tools available to perform supervised machine learning in R - one of the most popular packages for classification and regression is ‘caret’

install.packages("caret")

library(caret)

?train ## training using supplied data and model type

?predict ## predict values using trained model

https://topepo.github.io/caret/

List of >200 available models:

http://topepo.github.io/caret/train-models-by-tag.html

https://topepo.github.io/caret/available-models.html

33 of 42

Machine learning in R

Let’s take an example dataset ‘iris’ - which has measurements from multiple species of iris, and we will build a machine learning classifier of species label using the measurements

?iris

summary(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50

1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50

Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50

Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199

3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800

Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

34 of 42

Machine learning in R

Lets take an example dataset ‘iris’ - which has measurements from multiple species of iris, and we will build a machine learning classifier of species label using the measurements

ggplot(iris,aes(x=Petal.Length,y=Petal.Width,color=Species)) + geom_point()

ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width,color=Species)) + geom_point()

35 of 42

Machine learning in R

Let’s start by building a classifier of two iris species - virginica and versicolor

iris_filter <- filter(iris,Species!="setosa")

iris_filter$Species <- factor(iris_filter$Species)

summary(iris_filter)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

Min. :4.900 Min. :2.000 Min. :3.000 Min. :1.000 versicolor:50

1st Qu.:5.800 1st Qu.:2.700 1st Qu.:4.375 1st Qu.:1.300 virginica :50

Median :6.300 Median :2.900 Median :4.900 Median :1.600

Mean :6.262 Mean :2.872 Mean :4.906 Mean :1.676

3rd Qu.:6.700 3rd Qu.:3.025 3rd Qu.:5.525 3rd Qu.:2.000

Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500

36 of 42

Machine learning in R

First we will specify how we want to control the training - e.g. by cross-validation etc., and how we want to evaluate performance

Here we will:

Use 10-fold cross-validation
Use class probabilities and two class summary to be able to calculate ROC AUC

fitControl <- trainControl(

method = "repeatedcv", ## CV

number = 10, ## 10-fold

repeats = 10, ## repeated 10 times

classProbs=T,

summaryFunction = twoClassSummary)

37 of 42

Machine learning in R

Next we will train a model using the data and the specified 10-fold cross validation using a Support vector machine (SVM)

svm1 <- train(Species ~ ., data=iris_filter, method="svmRadial",verbose=T,trControl=fitControl,metric="ROC",preProc = c("center", "scale"),tuneLength=8)

Support Vector Machines with Radial Basis Function Kernel

100 samples

4 predictor

2 classes: 'versicolor', 'virginica'

Pre-processing: centered (4), scaled (4)

Resampling: Cross-Validated (10 fold, repeated 10 times)

Summary of sample sizes: 90, 90, 90, 90, 90, 90, ...

Resampling results across tuning parameters:

C ROC Sens Spec

0.25 0.9940 0.934 0.940

0.50 0.9936 0.938 0.938

1.00 0.9920 0.928 0.944

2.00 0.9916 0.928 0.946

4.00 0.9888 0.934 0.932

8.00 0.9812 0.924 0.946

16.00 0.9728 0.934 0.918

32.00 0.9620 0.918 0.922

ROC was used to select the optimal model using the largest value.

The final values used for the model were sigma = 0.3503754 and C = 0.25.

38 of 42

Machine learning in R

How would the results of using a different model compare to SVM? What about a naive Bayes model? The SVM performance is a bit better

nb1 <- train(Species ~ ., data=iris_filter, method="naive_bayes",verbose=T,trControl=fitControl,metric="ROC",preProc = c("center", "scale"),tuneLength=8)

Naive Bayes

100 samples

4 predictor

2 classes: 'versicolor', 'virginica'

Pre-processing: centered (4), scaled (4)

Resampling: Cross-Validated (10 fold, repeated 10 times)

Summary of sample sizes: 90, 90, 90, 90, 90, 90, ...

Resampling results across tuning parameters:

usekernel ROC Sens Spec

FALSE 0.9872 0.942 0.924

TRUE 0.9788 0.936 0.940

Tuning parameter 'laplace' was held constant at a value of 0

Tuning parameter 'adjust'

was held constant at a value of 1

ROC was used to select the optimal model using the largest value.

The final values used for the model were laplace = 0, usekernel = FALSE and adjust = 1.

39 of 42

Machine learning in R

What if we wanted to know what features had the biggest impact on the model?

Petal length and width appear to be the biggest driving factors in the model; conversely, sepal width has little to no importance - visually this seems accurate

varImp(svm1)

ROC curve variable importance

Importance

Petal.Length 100.00

Petal.Width 99.44

Sepal.Length 39.55

Sepal.Width 0.00

40 of 42

Machine learning in R

Next we obtain measurements from some new iris plants and want to classify their species as versicolor or virginica

new_iris <- data.frame(Sepal.Length=c(5,6.5,7.5,5),Sepal.Width=c(2.5,3.0,2.8,4.0),Petal.Length=c(4,5,6,2),Petal.Width=c(1,1.75,2,0.5))

predict(svm1,new_iris)

[1] versicolor virginica virginica virginica

Levels: versicolor virginica

predict(svm1,new_iris,type="prob")

versicolor virginica

1 0.98232257 0.01767743

2 0.31184177 0.68815823

3 0.01620222 0.98379778

4 0.38584620 0.61415380

41 of 42

In-class examples

On Friday we will go through:

- Example of dimension reduction using PCA

- Example of training a machine learning classifier on biomedical data

- Evaluating accuracy of the trained classifier

42 of 42

Summary

Machine learning identifies generalizable patterns in data that can be used to make predictions

Many different machine learning algorithms - each with strengths and weaknesses, and numerous different flavors - depending on your question, data and desired outputs different algorithms will be appropriate

Input data used to build a supervised model is typically split into training and test data which can be used to estimate model performance, e.g. precision and recall, and where hyperparameters can be optimized

Can be done fairly easily in R - more likely though you would end up using Python if you did a lot of machine learning