(XKCD. Randall Munroe)
(htps://imgs.xkcd.com/comics/machine_learning.png)
Machine Learning in the Thomson Lab
David Merrell
Thomson Lab meeting 2019-03-29
Motivation
Science is a contest of hypotheses.
Motivation
Science is a contest of hypotheses.
(Taken *without* permission from Li-Fang Chu)
Motivation
Science is a contest of hypotheses.
Machine Learning (ML) can be useful at any point in the lifespan of a hypothesis.
(Taken *without* permission from Li-Fang Chu)
Motivation
Science is a contest of hypotheses.
Machine Learning (ML) can be useful at any point in the lifespan of a hypothesis.
I’ll describe some of the ML work I’ve been doing with Ron Stewart.
(Taken *without* permission from Li-Fang Chu)
The Scientific Method
The Scientific Method
Machine Learning can be injected anywhere in this process!
The Scientific Method
Machine Learning can be injected anywhere in this process!
KinderMiner
Statistical Hypothesis Testing
Data processing
Clustering
PCA
Automation
Statistical
Design of Experiments
Active Learning
EHR mining
The Scientific Method
Machine Learning can be injected anywhere in this process!
Generally speaking, ML can augment any process that involves:
KinderMiner
Statistical Hypothesis Testing
Data processing
Clustering
PCA
Automation
Statistical
Design of Experiments
Active Learning
EHR mining
Using ML for prediction: Supervised Learning
Using ML for prediction: Supervised Learning
Using ML for prediction: Supervised Learning
Train the predictor:
Using ML for prediction: Supervised Learning
Train the predictor:
Using ML for prediction: Supervised Learning
Test the predictor; measure its performance:
Train the predictor:
Supervised Learning for Drug Repurposing
Aliper et al., 2016
Supervised Learning for Drug Repurposing
Aliper et al., 2016
Use a Neural Network to
predict drugs’ therapeutic uses...
NEURAL NETWORK
Supervised Learning for Drug Repurposing
Aliper et al., 2016
Use a Neural Network to
predict drugs’ therapeutic uses...
Aliper et al., 2016
When the neural network predicts the wrong therapeutic use, maybe that’s actually a drug repurposing opportunity.
Predicted Therapeutic Use
Known Therapeutic Use
(Aside: Neural Networks)
(Aside: Neural Networks)
(Aside: Neural Networks)
Linear Regression:
(Aside: Neural Networks)
Linear Regression:
Logistic Regression:
(Aside: Neural Networks)
Linear Regression:
Logistic Regression:
Neural Network:
(Aside: Neural Networks)
Linear Regression:
Logistic Regression:
“Deep” Neural Network:
[Aliper et al., 2016] Details
[Aliper et al., 2016] Details: Dataset
1,319,138 profiles
976 + 11,350
=12,797 genes
Drug-perturbed gene expression profiles (microarray)
51,383 perturbagens
76 cell lines
[Aliper et al., 2016] Details: Preprocessing
1,319,138 x 12,797
[Aliper et al., 2016] Details: Preprocessing
1,319,138 x 12,797
Restrict to A549, MCF7, PC3 cell lines; 678 drugs
[Aliper et al., 2016] Details: Preprocessing
26,420 x 976
1,319,138 x 12,797
Restrict to A549, MCF7, PC3 cell lines; 678 drugs
[Aliper et al., 2016] Details: Preprocessing
26,420 x 976
1,319,138 x 12,797
Restrict to A549, MCF7, PC3 cell lines; 678 drugs
OncoFinder:
gene expressions → pathway activations
[Aliper et al., 2016] Details: Preprocessing
26,420 x 976
1,319,138 x 12,797
Restrict to A549, MCF7, PC3 cell lines; 678 drugs
OncoFinder:
gene expressions → pathway activations
Discard “insignificantly perturbed”
profiles (p > 0.05)
[Aliper et al., 2016] Details: Preprocessing
26,420 x 976
1,319,138 x 12,797
Restrict to A549, MCF7, PC3 cell lines; 678 drugs
OncoFinder:
gene expressions → pathway activations
Discard “insignificantly perturbed”
profiles (p > 0.05)
9,352 x 976
(genes)
[Aliper et al., 2016] Details: Preprocessing
26,420 x 976
9,352 x 271
(pathways)
1,319,138 x 12,797
Restrict to A549, MCF7, PC3 cell lines; 678 drugs
OncoFinder:
gene expressions → pathway activations
Discard “insignificantly perturbed”
profiles (p > 0.05)
9,352 x 976
(genes)
[Aliper et al., 2016] Details: Machine Learning
9,352 x 271
(pathways)
9,352 x 976
(genes)
Data:
[Aliper et al., 2016] Details: Machine Learning
9,352 x 271
(pathways)
9,352 x 976
(genes)
Data:
Learning Systems:
“Deep” Neural Networks
[Aliper et al., 2016] Details: Machine Learning
9,352 x 271
(pathways)
9,352 x 976
(genes)
Support Vector Machines
(Baseline)
Data:
Learning Systems:
“Deep” Neural Networks
[Aliper et al., 2016] Details: Machine Learning
9,352 x 271
(pathways)
9,352 x 976
(genes)
Support Vector Machines
(Baseline)
Data:
Learning Systems:
“Deep” Neural Networks
Cross Validation Testing Framework
(Aside: Support Vector Machines)
Very simple idea for a predictor:
Find a line which separates the classes.
(Aside: Support Vector Machines)
Very simple idea for a predictor:
Find a line which separates the classes.
Classify new points by the�side of the line they land on.
(Aside: Support Vector Machines)
Very simple idea for a predictor:
Find a line which separates the classes.
Classify new points by the�side of the line they land on.
(Aside: Support Vector Machines)
Very simple idea for a predictor:
Find a line which separates the classes.
Classify new points by the�side of the line they land on.
(Aside: Support Vector Machines)
Very simple idea for a predictor:
Find a line which separates the classes.
Classify new points by the�side of the line they land on.
There are tricks for making very powerful�classifiers based on this concept.
[Aliper et al., 2016] Details: Results
[Aliper et al., 2016] Details: Results
Drug repurposing opportunities???
(That’s all they mention in the paper)
[Aliper et al., 2016] Replication
[Aliper et al., 2016] Replication: Preprocessing
26,420 x 976
1,319,138 x 12,797
Restrict to A549, MCF7, PC3 cell lines; 678 drugs
[Aliper et al., 2016] Replication: Preprocessing
26,420 x 976
1,319,138 x 12,797
Restrict to A549, MCF7, PC3 cell lines; 678 drugs
OncoFinder:
gene expressions → pathway activations
[Aliper et al., 2016] Replication: Preprocessing
26,420 x 976
1,319,138 x 12,797
Restrict to A549, MCF7, PC3 cell lines; 678 drugs
OncoFinder:
gene expressions → pathway activations
PROPRIETARY!
[Aliper et al., 2016] Replication: Preprocessing
26,420 x 976
1,319,138 x 12,797
Restrict to A549, MCF7, PC3 cell lines; 678 drugs
OncoFinder:
gene expressions → pathway activations
PROPRIETARY!
[Aliper et al., 2016] Replication: Machine Learning
Data:
26,420 x 976
[Aliper et al., 2016] Replication: Machine Learning
Data:
Learning Systems:
“Deep” Neural Networks
26,420 x 976
[Aliper et al., 2016] Replication: Machine Learning
Support Vector Machines
Data:
Learning Systems:
“Deep” Neural Networks
26,420 x 976
[Aliper et al., 2016] Replication: Machine Learning
Support Vector Machines
Data:
Learning Systems:
“Deep” Neural Networks
26,420 x 976
Naive Bayes
[Aliper et al., 2016] Replication: Machine Learning
Support Vector Machines
Data:
Learning Systems:
“Deep” Neural Networks
26,420 x 976
Naive Bayes
Random Forests
[Aliper et al., 2016] Replication: Machine Learning
Support Vector Machines
Data:
Learning Systems:
“Deep” Neural Networks
(correct) Cross Validation
Testing Framework
26,420 x 976
Naive Bayes
Random Forests
(Aside: Decision Trees & Random Forests)
A very practical
decision tree:
(Aside: Decision Trees & Random Forests)
Decision Trees: Start at the top and answer questions until you reach the bottom.
(Aside: Decision Trees & Random Forests)
Decision Trees: Start at the top and answer questions until you reach the bottom.
There are algorithms to build these trees from labeled data.
(Aside: Decision Trees & Random Forests)
Random Forests: Build many decision trees, but inject some randomness into them. Combine the trees’ decisions via plurality vote.
(Aside: Decision Trees & Random Forests)
Random Forests: Build many decision trees, but inject some randomness into them. Combine the trees’ decisions via plurality vote.
This collection of “cognitively diverse” decision trees can make better decisions than any individual tree!
[Aliper et al., 2016] Replication: Results
[Aliper et al., 2016] Replication: Results
Most drugs were mislabeled -- a full spreadsheet is available on request.
Given the low quality of prediction, it’s hard to say how useful they would be...
[Aliper et al., 2016] Replication: Lessons Learned
Current & Future Work:
Current & Future Work: Unsupervised Learning
Current & Future Work: Unsupervised Learning
Current & Future Work: Unsupervised Learning
Unsupervised Learning: Finding Patterns in Data
Unsupervised Learning: Finding Patterns in Data
Unsupervised Learning: Finding Patterns in Data
Unsupervised Learning: Finding Patterns in Data
Classic unsupervised learning tasks:
These tasks (and many others) can be formulated using Bayesian Statistics.
Exciting New Bayesian Tools!
Exciting New Bayesian Tools!
Exciting New Bayesian Tools!
Exciting New Bayesian Tools!
Probabilistic Programming
A convenient way to write down statistical models and perform inference.
→ Therefore, a convenient way to write down testable hypotheses.
Bayesian Hypothesis Testing
Classical frequentist hypothesis test: “Do we reject the null hypothesis?”��
(p-values, significance levels)
vs.
Bayesian Hypothesis Testing
Bayesian hypothesis test: “Which hypothesis is more probable?”��
�(Goodbye, significance. Hello, Bayes factors!)
Statistical methodologies beyond p-values and significance levels...
vs.
Thank You
In particular:
Ron Stewart
Finn Kuusisto
David Page (BMI Dept)
The Bioinformatics Team
Questions?
EXTRA SLIDES
The Scientific Method
The Scientific Method & Artificial Intelligence
The Scientific Method & Artificial Intelligence
The Scientific Method & Artificial Intelligence
The Scientific Method & Artificial Intelligence
& Machine Learning
The Scientific Method & Artificial Intelligence
& Machine Learning
Supervised Learning
(Regression and Classification)
The Scientific Method & Artificial Intelligence
& Machine Learning
Unsupervised Learning
(finding patterns in data)
The Scientific Method & Artificial Intelligence
& Machine Learning
Reinforcement Learning/
Active Learning
(autonomous control)