A brief intro to machine learning
BIOM285
May 21, 2025
Machine learning
What is machine learning? Building models that learn inherent properties of data
What are the models ‘learning’? Patterns that are generalizable in order to gain insight into the data and/or make predictions about new data
Machine learning is not statistics - statistics performs inference on a population from a sample, whereas machine learning makes predictions based on patterns that it learns. Machine learning is a type of artificial intelligence
Machine learning algorithms
Many different options for machine learning
Machine learning
In machine learning you build a model from some input (i.e. training) data, and then use that model to predict the outcome of new data
Some types of machine learning questions:
Classification - predict whether data is one pre-defined category or another
Regression - predict whether data has a specific quantity
Clustering - predict whether data is in a category defined from the data itself
Dimension reduction - determine the high-level features of data
Machine learning
In machine learning you build a model from input (i.e. training) data, and can then use that model to predict the outcome of new data
Some types of machine learning:
Supervised - pre-specify the outcomes (labels) of the training data, and then learn patterns that separate the data - can be categorical (classification) or continuous (regression)
Unsupervised - no pre-specification of outcomes for the training data, learn patterns that describe data as groups (e.g. clusters) or quantities (e.g. dimension reduction)
Semi-supervised - some of the training data is labeled and the rest isn’t, learn patterns for the unlabeled data using the labeled data
Reinforcement learning
Machine learning
Supervised model - identifying patterns that separate labeled, categorical data
Machine learning
Unsupervised model - identify structure without using any pre-specified labels
Machine learning algorithms
Which machine learning method should I use?
Some considerations:
1. What question I am asking?
Do I want to predict categories or values from labeled data? Classify two categories or multiple categories? Or do I want to learn structure in unlabeled data?
2. Are there constraints on speed, memory and computing needed to create the model?
Models such as linear regression and Naive Bayes are fast, whereas neural nets can often be quite time consuming and may also need specialized computing (e.g. GPUs)
3. How much training data do I have?
Certain machine learning models work better with large amounts of training data - e.g. neural networks - whereas others can work well even with very little training data - e.g. naive Bayes
4. How interpretable do I want the resulting model?
Do I want to know the underlying features that contributed to a prediction - or is that not important? E.g. linear regression, naive Bayes provide interpretable features, whereas neural nets etc. might not
Machine learning algorithms
Types of machine learning algorithms:
Supervised:
Linear regression, logistic regression, LASSO/Elastic Net, Support vector machine (SVM), Naive Bayes, Random Forest/Gradient boosting, K-nearest neighbors (KNN), Neural networks (CNN)
Unsupervised:
Clustering (K-means, Hierarchical), dimension reduction (PCA, tSNE, SVD)
Semi-supervised:
Label spreading, self-learning, semi-supervised support vector machine (S3VM), graph-based, co-training
Many are implemented in R - if you end up doing a lot of machine learning though you likely would end up using python
Machine learning algorithms
Machine learning algorithms
Machine learning - dimension reduction
Dimension reduction is learning a lower dimension representation of high dimensional data that retains some of the most prominent patterns in the data
Principal components analysis (PCA)
Eigenvectors derived from a covariance matrix (e.g. pairwise combinations of covariance between measurements) of the data
Eigenvectors (PCs) are then sorted based on variance- the top PCs are the ones explaining most of the variance in the data
Others: SVD, LSI/LSA
Machine learning - dimension reduction
Dimension reduction is learning a lower dimension representation of high dimensional data that retains some of the most prominent patterns in the data
PCA has a couple limitations - including that it assumes linear relationships between the data
Non-linear dimension reduction techniques include things you’ve probably also heard of - tSNE, UMAP etc.
Machine learning algorithms
Machine learning algorithms
Some commonly used ML algorithms for classification
Decision trees/Random forest
Decision trees are simple representations of classifiers. Random forests work by creating many many different decision trees and returning the class or value with the majority of the vote across individual trees
Decision tree:
Machine learning algorithms
Some commonly used ML algorithms for classification
Decision trees/Random forest
Decision trees are simple representations of classifiers. Random forests work by creating many many different decision trees and returning the class or value with the majority of the vote across individual trees
Strengths:
Weaknesses:
Machine learning algorithms
Some commonly used ML algorithms for classification
Support vector machine
Used for classifying two categories by identifying a hyperplane that maximizes the distance between the marginal samples (support vectors) for each class
Machine learning algorithms
Some commonly used ML algorithms for classification
Naive Bayes classifier
Uses Bayes theorem determine probabilities that each feature will belong to each class based on training data and then classify new data based on these probabilities - the ‘naive’ part comes from assuming that each feature is independent. Simple but surprisingly effective (and interpretable)
Test accuracy=.95, COVID rate=.05
What is the probability I have COVID if I have a positive test?
Machine learning algorithms
Some commonly used ML algorithms for classification
Naive Bayes classifier
Uses Bayes theorem determine probabilities that each feature will belong to each class based on training data and then classify new data based on these probabilities - the ‘naive’ part comes from assuming that each feature is independent. Simple but surprisingly effective (and interpretable)
P(COVID|Positive test) = .95 * .05/(.05*.95 + .95*.05) = .50
Test accuracy=.95, COVID rate=.05
What is the probability I have COVID if I have a positive test?
Machine learning algorithms
Considerations - pre-processing data
Scaling data: Some machine learning algorithms perform better when the data is first scaled and centered prior to input to model training
Feature selection: Some machine learning algorithms can make use of all available features in building models. Other algorithms benefit from sub-selecting features before building the model - e.g. remove uninformative features, or to prune out redundant features
Machine learning algorithms
Considerations - hyperparameters
Hyperparameters are parameters of the algorithm that impact the ability of the model to learn from data - can be optimized as part of training to improve model performance
Distinct from the parameters learned by the model which can’t be changed
Different algorithms have different hyperparameters which can be modified in training
For example - the C parameter in a SVM determines how much misclassification of training data you want to avoid
Training supervised models
How do we evaluate performance of a model with labeled outcomes?
When predicting continuous values e.g. in regression - we can use the R-squared mean error (RSME) which we covered previously - correlation between actual and predicted values
For a classification problem - we can compare predicted labels to the ‘true’ labels to determine whether the model predictions are accurate
Confusion matrix:
| Actual Cat | Not actual cat |
Predicted Cat | True positive (TP) | False positive (FP) |
Not predicted Cat | False negative (FN) | True Negative (TN) |
Training supervised models
How do we evaluate performance of a supervised model?
For a classification problem - we can compare predicted labels to the ‘true’ labels to determine whether the model predictions are accurate
Confusion matrix:
| Actual Cat | Not actual Cat |
Predicted Cat | True positive (TP) | False positive (FP) |
Not predicted Cat | False negative (FN) | True Negative (TN) |
Accuracy = TP+TN / All
What percentage of all predictions were correct?
Sensitivity/recall = TP/(TP+FN)
What percentage of true positives were correct?
Specificity = TN/(TN+FP)
What percentage of true negatives were correct?
Precision = TP/(TP+FP)
What percentage of predicted positives were correct?
Training supervised models
How do we evaluate performance of a supervised model?
For a classification problem - we can compare predicted labels to the ‘true’ labels to determine whether the model predictions are accurate
Confusion matrix:
| Actual Cat | Not actual cat |
Predicted Cat | True positive (TP) | False positive (FP) |
Not predicted Cat | False negative (FN) | True Negative (TN) |
F1 = 2 * (precision * recall)/
(precision + recall)
Sensitivity/recall = TP/(TP+FN)
What percentage of true positives were correct?
Precision = TP/(TP+FP)
What percentage of predicted positives were correct?
One measure to summarize both precision and recall:
Receiver operating characteristic (ROC)
Visual representation of a model performance by comparing the true positive rate (sensitivity) and the false positive rate (1-specificity)
Receiver operating characteristic (ROC)
When classifying data - the threshold used to discriminate positive or negative values could vary. A ROC curve shows changes in TPR/FPR as a function of this discrimination threshold
At one extreme discrimination threshold all values are predicted as negative so the TPR and FPR are both 0, and at the other extreme threshold TPR/FPR are both 1
But in between is TPR generally recovered at a faster rate than FPR? And at what threshold is this difference greatest?
Receiver operating characteristic (ROC)
When classifying data - the threshold used to discriminate positive or negative values could vary. A ROC curve shows changes in TPR/FPR as a function of this discrimination threshold
Given a ROC curve it is possible to then calculate the ‘area’ under the curve (AUC)
A perfect classifier has an AUC of 1 - a random classifier has an AUC of .5
We therefore hope for an AUC greater than .5 (random) - the higher the better the classifier
Training supervised models
We can train and evaluate a model using labeled data... but how do we know our model is any good at actually predicting things in the real world?
One issue in machine learning is ‘overfitting’ - what is overfitting?
Training supervised models
We can train and evaluate a model using labeled data... but how do we know our model is any good at actually predicting things in the real world?
One issue in machine learning is ‘overfitting’ - meaning that when models are applied to independent data they have much lower predictive power than in training. In other words - the models are really good at predicting training data, but then don’t have much real world application
We would ideally like to evaluate the performance of our model on independent labeled data to give a truer sense of performance - however, depending our our question, it might not always be possible to have access to additional data with labels
What can we do?
Cross-validation
When training a model - we can ‘leave-out’ or randomly set aside some of the training data to save as test data
Say we use 80% of the data as a training set and 20% as a test set - then we can make predictions using the test set as ‘independent’ data not used in building the model.
However, this might produce different results depending on the split. More ideal is to be able to still use all of the training data as part of this procedure
Cross-validation separates the training data into multiple ‘splits’ or ‘folds’ - and then performs training/testing across the different splits
3-fold cross-validation:
Cross-validation
Cross-validation separates the training data into multiple ‘splits’ or ‘folds’ - and then performs training/testing across the different splits
3-fold cross-validation:
Based on the parameters of all models from cross-validation a final model is then finally fit to all of the data using these parameters
Typically an independent test set is also used to evaluate model performance
Machine learning in R
There are numerous tools available to perform supervised machine learning in R - one of the most popular packages for classification and regression is ‘caret’
install.packages("caret")
library(caret)
?train ## training using supplied data and model type
?predict ## predict values using trained model
Machine learning in R
Let’s take an example dataset ‘iris’ - which has measurements from multiple species of iris, and we will build a machine learning classifier of species label using the measurements
?iris
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Machine learning in R
Lets take an example dataset ‘iris’ - which has measurements from multiple species of iris, and we will build a machine learning classifier of species label using the measurements
ggplot(iris,aes(x=Petal.Length,y=Petal.Width,color=Species)) + geom_point()
ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width,color=Species)) + geom_point()
Machine learning in R
Let’s start by building a classifier of two iris species - virginica and versicolor
iris_filter <- filter(iris,Species!="setosa")
iris_filter$Species <- factor(iris_filter$Species)
summary(iris_filter)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.900 Min. :2.000 Min. :3.000 Min. :1.000 versicolor:50
1st Qu.:5.800 1st Qu.:2.700 1st Qu.:4.375 1st Qu.:1.300 virginica :50
Median :6.300 Median :2.900 Median :4.900 Median :1.600
Mean :6.262 Mean :2.872 Mean :4.906 Mean :1.676
3rd Qu.:6.700 3rd Qu.:3.025 3rd Qu.:5.525 3rd Qu.:2.000
Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500
Machine learning in R
First we will specify how we want to control the training - e.g. by cross-validation etc., and how we want to evaluate performance
Here we will:
fitControl <- trainControl(
method = "repeatedcv", ## CV
number = 10, ## 10-fold
repeats = 10, ## repeated 10 times
classProbs=T,
summaryFunction = twoClassSummary)
Machine learning in R
Next we will train a model using the data and the specified 10-fold cross validation using a Support vector machine (SVM)
svm1 <- train(Species ~ ., data=iris_filter, method="svmRadial",verbose=T,trControl=fitControl,metric="ROC",preProc = c("center", "scale"),tuneLength=8)
Support Vector Machines with Radial Basis Function Kernel
100 samples
4 predictor
2 classes: 'versicolor', 'virginica'
Pre-processing: centered (4), scaled (4)
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 90, 90, 90, 90, 90, 90, ...
Resampling results across tuning parameters:
C ROC Sens Spec
0.25 0.9940 0.934 0.940
0.50 0.9936 0.938 0.938
1.00 0.9920 0.928 0.944
2.00 0.9916 0.928 0.946
4.00 0.9888 0.934 0.932
8.00 0.9812 0.924 0.946
16.00 0.9728 0.934 0.918
32.00 0.9620 0.918 0.922
ROC was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.3503754 and C = 0.25.
Machine learning in R
How would the results of using a different model compare to SVM? What about a naive Bayes model? The SVM performance is a bit better
nb1 <- train(Species ~ ., data=iris_filter, method="naive_bayes",verbose=T,trControl=fitControl,metric="ROC",preProc = c("center", "scale"),tuneLength=8)
Naive Bayes
100 samples
4 predictor
2 classes: 'versicolor', 'virginica'
Pre-processing: centered (4), scaled (4)
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 90, 90, 90, 90, 90, 90, ...
Resampling results across tuning parameters:
usekernel ROC Sens Spec
FALSE 0.9872 0.942 0.924
TRUE 0.9788 0.936 0.940
Tuning parameter 'laplace' was held constant at a value of 0
Tuning parameter 'adjust'
was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were laplace = 0, usekernel = FALSE and adjust = 1.
Machine learning in R
What if we wanted to know what features had the biggest impact on the model?
Petal length and width appear to be the biggest driving factors in the model; conversely, sepal width has little to no importance - visually this seems accurate
varImp(svm1)
ROC curve variable importance
Importance
Petal.Length 100.00
Petal.Width 99.44
Sepal.Length 39.55
Sepal.Width 0.00
Machine learning in R
Next we obtain measurements from some new iris plants and want to classify their species as versicolor or virginica
new_iris <- data.frame(Sepal.Length=c(5,6.5,7.5,5),Sepal.Width=c(2.5,3.0,2.8,4.0),Petal.Length=c(4,5,6,2),Petal.Width=c(1,1.75,2,0.5))
predict(svm1,new_iris)
[1] versicolor virginica virginica virginica
Levels: versicolor virginica
predict(svm1,new_iris,type="prob")
versicolor virginica
1 0.98232257 0.01767743
2 0.31184177 0.68815823
3 0.01620222 0.98379778
4 0.38584620 0.61415380
In-class examples
On Friday we will go through:
- Example of dimension reduction using PCA
- Example of training a machine learning classifier on biomedical data
- Evaluating accuracy of the trained classifier
Summary
Machine learning identifies generalizable patterns in data that can be used to make predictions
Many different machine learning algorithms - each with strengths and weaknesses, and numerous different flavors - depending on your question, data and desired outputs different algorithms will be appropriate
Input data used to build a supervised model is typically split into training and test data which can be used to estimate model performance, e.g. precision and recall, and where hyperparameters can be optimized
Can be done fairly easily in R - more likely though you would end up using Python if you did a lot of machine learning