Error-controlled interaction discovery in generic machine learning models
Yang Lu
University of Waterloo
Scientific
discovery
Data
Machine learning is revolutionizing the biomedical research
2
Hypothesis
Evaluation
Data
ML Model
Hypothesis
Evaluation
Big data
Hypothesis-driven paradigm
Data-driven paradigm
Hypothesis generation in a data-driven paradigm
3
Prediction
Hypothesis generation
Generated disease-specific and testable hypotheses
Input: biomedical data
ML models
Output: disease
Biomarkers
Pathways
Interactions
Causalities
Lu et al. NeurIPS (2018)
Coming Soon!
Future
Work!
We are interested in non-additive interactions
4
Definition: A non-additive interaction cannot be decomposed into a sum of univariate functions.
Non-additivity
Challenges in data-driven interaction discovery
“Black-box” ML models
Overwhelming hypotheses
Many false positives
We aim to provide confidence estimation for data-driven non-additive interaction discovery
6
Features
Samples
ML models
Predict
Goal: Discover a subset of non-additive interactions (e.g., gene-gene interactions) from ML models that are likely to be relevant without too many false positives.
Interpret
Train
We use false discovery rate (FDR) as the confidence measure
7
Benjamini & Hochberg. Journal of the Royal statistical society: series B (1995)
# of false positive interactions
# of total accepted interactions
FDR =
E
How to estimate FDR in “black-box” ML models?
Total accepted
interactions
Score cutoff
threshold
Estimating false positives typically involves handling p-values
Features are quantified using interpretation methods
8
Features
Features
Features
Samples
ML models
Interpret
Features
Train
Popular interpretation methods:
Pairwise
importance
Marginal importance
We distill non-additive interactive effects from the reported pairwise importance
9
Features
Features
Features
Prediction-independent
feature biases
Features
Pairwise
importance
Marginal importance
Presence of features
Non-additive
interactive effect
Reported
Pairwise importance
Prediction-dependent
marginal effects
FDR control is obtained by using knockoffs
10
Barber and Candes. The Annals of Statistics (2015)
Knockoff
Generator
Features
Samples
Knockoff features
Samples
The knockoffs are designed to replicate the intra- and inter-correlation structure of the input
11
The knockoffs preserve the correlation structure of the input
The knockoffs preserve correlations between themselves and input
T
T
T
T
The intuition behind knockoff design
12
Exchangeability
The input gene and its knockoff counterpart should be:
Features
Samples
Knockoff features
Samples
Equally likely to be correlated with noise
Equally likely to be correlated with signal
Equally likely to be a false positive
We train and interpret ML models using both the original features and their knockoffs
13
ML models
Predict
Train
Interpret
FDR is estimated based on knockoff-involving interactions
14
Intuition of FDR estimation:
Important original interactions have large scores
Knockoff and irrelevant original interactions are similar in dist.
Total discoveries
False discoveries
# OO
# KO - # KK
Walzthoeni et al. Nature Methods (2012)
Stop at the last time when ratio is below the target FDR level
We evaluated our method on a a test suite of simulation functions
15
Simulation
function:
FDR
Power
Distilled non-additive interactive effects are important for FDR control
16
FDR
FDR
Non-additive interaction effects
Reported interaction importance
Distilled non-additive interactive effects exhibit similar distributions between original and knockoff interactions
17
Non-additive interaction effects
Reported interaction importance
Cumulative density
Cumulative density
We applied our method to a real Drosophila enhancer dataset to study the enhancer activity
18
Data:
Basu et al. Proceedings of the National Academy of Sciences (2018)
Task:
Evaluation:
FDR threshold
Distilled non-additive interactive effects support the synergy between the proteins Krueppel and Twist
19
Conclusion: A general pipeline for data-driven scientific discovery
“Black-box” ML models
Overwhelming hypotheses
Many false positives
Interpretable AI
Synergy
distillation
Knockoff-based FDR control
Impact: Pioneering effort demonstrating that interpretation of ML models could achieve statistical guarantees!
Questions?
22
Conclusion
23
Our method enables error-controlled interaction discovery in generic ML models without relying on p-values
Key idea:
Impact:
Research vision: Empower the data-driven biology by developing a hypothesis generation engine
24
Data-driven
hypothesis
generation engine
Hypotheses
Scientific
discovery
Data
Users
Modelling
Existing work to generate knockoffs
25
No method supports challenging settings such as generic ML models
We distill non-additive interactive effects from the reported pairwise importance
26
Features
Features
Features
Prediction-dependent
marginal effects
Prediction-independent
feature biases
Non-additive
interactive effect
Features
Pairwise
importance
Marginal importance
Presence of features
We evaluated on a a test suite of simulation functions
27
Tsang et al. International Conference on Learning Representations (2018)