Machine Learning in Biomedicine
Sean Davis, MD, PhD
Center for Cancer Research, National Cancer Institute,
National Institutes of Health
July 9, 2017
@seandavis12
Gartner Hype Cycle
ML Tech Triggers
ML Tech Triggers
ML Tech Triggers
“A computer program is said to learn from experience (E) with some class of tasks (T) and a performance measure (P) if its performance at tasks in T as measured by P improves with E”
Applications
Algorithms
Supervised learning
Unsupervised Learning
An early example--gene expression classifier
An early genomics example of supervised ML
Result is a program to
perform predictions!
Naive Bayes
Individual features as predictors
Likelihood ratio
Feature relationships
Exploiting independence
RandomForests
Slides from Anshul Kundaje
Functional Elements in the genome
Ecker et al. 2012
Repressed Gene
Transcription factors
(Regulatory proteins)
Enhancer
Insulator
Promoter
mRNA
Protein
Active Gene
Nucleosomes
Chromatin (histone) modifications
DNA
Motif
Dynamic Regulation of gene expression
Chromatin immunoprecipitation (ChIP-seq)
Protein-DNA binding maps
Maps of histone modifications
Maps of histone variants
Chromatin accessibility (DNase-seq, FAIRE-seq, ATAC-seq) and nucleosome sequencing (MNase-seq)
DNase-seq (ATAC-seq) ~= sum(ChIP-seq for all TFs)
ENCODE functional signal maps
Chromatin modification maps
Transcription factor binding map
H3K4me3
H3K27ac
H3K4me1
H3K36me3
H3K27me3
Relationship between chromatin marks and gene expression
Aggregation analysis and simple univariate correlation analysis suggest strong positive or negative relationships between gene expression and enrichment of chromatin marks at gene promoters
What is the collective predictive power of a set of chromatin marks? Which ones are more predictive?
Multivariate predictive model
Linear Regression model
Minimize square error to find betas
Input variables (features)
Regression coefficients, correlation, independence
A non-linear model (decision/regression tree)
H3K4me3 > 5
H3K4me1 > 10
H3K27me3 > 10
Multivariate predictive model (Random forest)
Random forests
Utilizes random sampling over
to construct a collection of decision trees (forest) with significantly improved prediction performance
Learning algorithm
RuleFit3: Friedman et al. 2005
TF1
TF2
TF3
rk
Projected co-association scores
Partial Dependence of F(X) on a set (eg. pairs) of TFs ‘g’
G = complement of g
Co-association score between pairs of TFs
RuleFit3: Friedman et al. 2005
Deep learning example
Several slides adapted from Anshul Kundaje
Deep learning
https://www.slideshare.net/LuMa921/deep-learning-a-visual-introduction
What does deep learning learn?
https://www.slideshare.net/LuMa921/deep-learning-a-visual-introduction
https://www.slideshare.net/LuMa921/deep-learning-a-visual-introduction
Chromatin architecture cartoon
ATAC-Seq and fragment lengths
Chromatin architecture and chromatin state
Transform a 1D dataset into 2D
Transformed to an image classification problem!
Convolution
Convolution
Convolution learns multiple low-level features
Pooling, or smoothing, to gain robustness
How does learning proceed?
How does learning proceed?
How does learning proceed?
Learn from sequence as well
Kundaje’s Chromputer
Closing thoughts
MANY, MANY more applications
https://www.cancer.gov/research/key-initiatives/moonshot-cancer-initiative/blue-ribbon-panel/blue-ribbon-panel-report-2016.pdf
Internet of Things
Potential to fundamentally change the way we interact with research subjects, patients, and the general population.