ChromBPNet: Deep learning models of base-resolution chromatin profiles reveal cis-regulatory syntax and regulatory variation
AGCCAAGCAGCAAAGTTTTGCTGCTGTTTATTTTTGTAGCTCTTACTATATTCTACTTTTACCATTGAAAATATTGAGGAAGTTATTTATATTTCTATTTTTTATATATTATATATTTTATGTATTTTAATATTACTATTACACATAATTATTTTTTATATATATGAAGTACCAATGACTTCCTTTTCCAGAGCAATAATGAAATTTCACAGTATGAAAATGGAAGAAATCAATAAAATTATACGTGACCTGTGGCGAAGTACCTATCGTGGACAAGGTGAGTACCATGGTGTATCACAAATGCTCTTTCCAAAGCCCTCTCCGCAGCTCTTCCCCTTATGACCTCTCATCATGCCAGCATTACCTCCCTGGACCCCTTTCTAAGCATGTCTTTGAGATTTTCTAAGAATTCTTATCTTGGCAACATCTTGTAGCAAGAAAATGTAAAGTTTTCTGTTCCAGAGCCTAACAGGACTTACATATTTGACTGCAGTAGGCATTATATTTAGCTGATGACATAATAGGTTCTGTCATAGTGTAGATAGGGATAAGCCAAAATGCAATAAGAAAAACCATCCAGAGGAAACTCTTTTTTTTTTCTTTTTCTTTTTTTTTTTTCCAGATGGAGTCTCGCACTTCTCTGTCACCCGGGCTGGAGCGCAGTGGTGCAATCTTGGCTCACTGCAACCTCCACCTCCTGGGTTCAGGTGATTCTCCCACCTCAGCCTCCCGAGTAGTAGCTGGAATTACAGGTGCGCGCTCCCACACCTGGCTAATTTTTTGTATTCTTAGTAGAGATGGGGTTTCACCATGTTGGCCAGGCTGGTCTCAAACTCCTGCCCTCAGGTGATCTGCCCACCTTGGCCTCCCAGTGTTGGGTTTACAGGCGTGAGCCACCGCGCCTGGCCTGGAGGAAACTCTTAACAGGGAAACTAAGAAAGAGTTGAGGCTGAGGAACTGGGGCATCTGGGTTGCTTCTGGCCAGACCACCAGGCTCTTGAATCCTCCCAGCCAGAGAAAGAGTTTCCACACCAGCCATTGTTTTCCTCTGGTAATGTCAGCCTCATCTGTTGTTCCTAGGCTTACTTGATATGTTTGTAAATGACAAAAGGCTACAGAGCATAGA
Anna Shcherbina
Anusri Pampari
Anshul Kundaje
Regulatory DNA
Adapted from Shlyueva et al. (2014) Nature Reviews Genetics.
transcription factors
nucleosomes
sequence motifs
histone modifications
Sequence motifs
accessible chromatin
Profiling regulatory DNA
ATAC-seq/DNase-seq
Predictive sequence models of chromatin profiles
…GACAGATAATGCATTGA…
…GACTTGAAACGGCATTG…
No Peak (0) (0.3)
Peak (+1) (20.2)
Class = +1 (20.2)
Class = +1 (10.6)
Class = +1 (15.8)
Class = 0 (0.3)
Class = 0 (1.2)
Class = 0 (3.5)
Peak
No peak
…GACAGATAATGCATTGA…
…ACTGTCATGGATATTCT…
…GACTTGAAACGGCATTG…
…CAGTATGCATACGTGAA…
…CAACCTTGAACGGCATTG…
…GATATTCTACTGTAAG…
Arvey et al. 2012
Ghandi et al. 2014
Setty et al. 2015
Alipanahi et al. 2015
Zhou et al. 2015
Kelly et al. 2016, 2018
Avsec et al. 2021
DNase-seq / ATAC-seq data
Complex, multi-resolution ‘shapes’, ‘spans’ and footprints
DNase-seq / �ATAC-seq
Consider an enzyme that only cuts at C
An extreme case
ACGAAACAATTGAGATACCAAAGTAAGTAT
True accessibility
Enzyme bias
Observed accessibility
We are interested in true accessibility,
which unfortunately can get distorted
by the enzyme bias depending on how
strong the enzyme bias is.
Enzyme preference bias
Tn5 and DNase-I enzyme sequence bias (PWM and k-mer models)
Tn5 cleavage logo
Dnase-I cleavage logo
HINT-ATAC (Li et al 2019)
ChromBPNet: Sequence to base-pair chromatin accessibility profiles
C
G
A
T
A
A
C
C
G
A
T
A
T
1 Kb sequence
Based on Avsec et al. Nature Genetics 2021
NN enzyme bias predictor
total Tn5/DNase insertions (1 kb)
base-resolution probability profile (1 kb)
How to estimate Tn5 / DNase bias?
Read distribution in background regions is a function of enzyme bias
Learns multiple Tn5
bias motifs
K562 DNase-seq prediction performance (held-out chromosomes)
Log(observed counts)
Spearman correlation = 0.7
Log(predicted total counts)
Total counts prediction performance
Jensen-Shannon Distance
Worst limit
Best limit
Observed vs. predicted profile
Profile prediction performance
ChromBPNet provides denoising and imputation of footprints at individual loci
Tn5
denoising!
Using existing work (HINT-ATAC and TOBIAS)
Observed track
ChromBPNet
without
Bias correction
ChromBPNet
bias model
ChromBPNet
with
HINT-ATAC correction
ChromBPNet
With
TOBIAS correction
ChromBPNet
with
bias correction
CTCF
Tn5
CTCF Footprint
Sequence contribution scores are obtained using an algorithm called DeepLIFT (Shrikumar et al 2017))
High concordance of footprints and contribution scores after correction in ATAC-seq and DNASE-seq
ATAC-seq
Observed track
Without bias correction predictions
bias corrected predictions
bias corrected contribution scores
DNase-seq
Observed track
Without bias correction predictions
bias corrected predictions
bias corrected contribution scores
K562 Models
NRF1
NRF1
KLF12
Deciphering cell-type specific motifs and footprints
TF-MoDISCO: Cluster and consolidate predictive subsequences into contribution weight matrix (CWM) motifs
14
Insight: conv. filter contributions are integrated at the nucleotide level
Sequence 1
Sequence 2
Sequence 3
DeepLIFT
DeepLIFT
DeepLIFT
Shrikumar et al. 2018, arxiv
Avanti Shrikumar
Alex Tseng
GM12878
IRF
SP/KLF
RUNX
SPI1
ELK/ETS
NFKB
AP1
ATF
NFY
K562
GATA+TAL
SP/KLF
CTCF
ELK/ETS
AP1
NRF1
NFY
ATF
ETV
HepG2
KLF
FOXA
HNF4G
CTCF
GABPA
CEBPB
AP1
NFY
TCF7L2
H1-hESC
KLF
OCT-SOX
CTCF
ZIC3
SOX2
SP
TEAD
NRF1
NFY
Soumya Kundu
NfkB
GATA
HNF4A
ChromBPNet can predict marginal footprints of cell-type specific TFs
SP1
Uncorrected
Profile probability prediction
GM1287
ATAC-seq
K562
ATAC-seq
HEPG2
ATAC-seq
200bp surrounding the motif insertion site
NfkB
GATA
HNF4A
ChromBPNet can predict marginal footprints of cell-type specific TFs
SP1
Uncorrected
Profile probability prediction
200bp surrounding the motif insertion site
Corrected
GM1287
ATAC-seq
K562
ATAC-seq
HEPG2
ATAC-seq
GATA+TAL
ChromBPNet allows systematic comparison of DNASE & ATAC-seq footprints
Uncorrected
Profile probability prediction
200bp surrounding the motif insertion site
K562
ATAC-seq
K562
DNASE-seq
Corrected
NFYB
GABPA
BACH
NRF1
HNF4A
(control)
K562
ATAC-seq
K562
DNASE-seq
200bp surrounding the motif insertion site
Cooperative effects on footprints
AP1 + TEAD in fibroblasts
AP1
TEAD
AP1+TEAD
Surag Nair
Profile probability prediction
200bp surrounding the motif insertion site
Regulatory variant effect prediction
ChromBPNet can predict TF binding QTLs in LCLs for multiple chromatin read outs
SPI1 TF ChIP-seq
DNase-seq
H3K27ac
ChIP-seq
1 Kb
5 Kb
1 Kb
rs5764238 (ref=C)
SPI1 bQTs: Tehranchi et al. 2016
Example: SPI1 bQTL in LCLs
rs5764238 (alt=G)
rs5764238 (ref=C)
SPI1 TF ChIP-seq
DNase-seq
H3K27ac
ChIP-seq
1 Kb
5 Kb
1 Kb
ChromBPNet reveals “blast radius” of regulatory variants on diff. chromatin readouts
“Blast radius” of a variant
1 Kb
5 Kb
1 Kb
SPI1 TF ChIP-seq
DNase-seq
H3K27ac
ChIP-seq
DeepLIFT provides insights into motifs disrupted by
ref=C
ref=C
ref=C
alt=G
alt=G
alt=G
200 bp
200 bp
200 bp
1 Kb
5 Kb
1 Kb
SPI1 motifs
SPI1 TF ChIP-seq
DNase-seq
H3K27ac
ChIP-seq
ChromBPNet accurately predicts effect sizes and directions of LCL SPI1 bQTLs
SPI1 bQTs: Tehranchi et al. 2016
Showing 100 most significant and 100 most non-significant (chosen randomly from 200 most non-significant) variants from SPI1 bQTL
GM12878 ATAC-seq model
ChromBPNet outperforms deltaSVM for predicting dsQTLs in LCLs
dsQTLs: Degner et al. 2012
deltaSVM: Lee et al. 2015
Predictions on 579 significant dsQTL SNPs and 30K non-dsQTL SNPs (50 times more than the significant set) used in deltaSVM
Summary
Acknowledgements
R01ES02500902
1U24HG009446
1U01HG009431
Funding
1DP2OD022870
1R01HG009674
Surag Nair
Aman Patel
Soumya Kundu
Anshul Kundaje
Jacob Schreiber
Austin Wang
Avanti Shrikumar
Additional Slides