Simon G. Coetzee
Hazelett Lab
CBM
What does it do?
MotifbreakR is a software tool in R/Bioconductor that scans genetic variants against position weight matrices of TFs to determine the potential for the disruption of TF binding at the site of the variant.
Leveraging Bioconductor
Bioconductor contains the sequence of 76 genome builds across 22 species.
This allows for importing variants from a wide variety of experiment types, from human disease to model organisms.
Bioconductor package MotifDB contains 12,657 PWMs from across 22 databases and 62 organisms.
It’s possible, though not easy, to create your own MotifDB object for PWMs that you’ve found yourself.
Where is it used?
We originally published motifbreakR for variant analysis of genome-wide association studies (GWAS).
Since then: Many studies of human disease, from neurodegenerative disorders to COVID-19 research.
And even further afield …
Model organisms:
Stranger Organisms:
Paleoanthropology:
Evolutionary Genomics:
…
How do we get our data into motifbreakR?
BED format
VCFs
lists of rsIDs
and even “custom” variants chr:pos:ref:alt
Originally only SNVs, but now also INDELs
Practically motifbreakR has involves three phases.
Built in Motif Databases
## MotifDb object of length 12657
## | Created from downloaded public sources, last update: 2022-Mar-04
## | 12657 position frequency matrices from 22 sources:
## | FlyFactorSurvey: 614
## | HOCOMOCOv10: 1066
## | HOCOMOCOv11-core-A: 181
## | HOCOMOCOv11-core-B: 84
## | HOCOMOCOv11-core-C: 135
## | HOCOMOCOv11-secondary-A: 46
## | HOCOMOCOv11-secondary-B: 19
## | HOCOMOCOv11-secondary-C: 13
## | HOCOMOCOv11-secondary-D: 290
## | HOMER: 332
## | JASPAR_2014: 592
## | JASPAR_CORE: 459
## | ScerTF: 196
## | SwissRegulon: 684
## | UniPROBE: 380
## | cisbp_1.02: 874
## | hPDI: 437
## | jaspar2016: 1209
## | jaspar2018: 1564
## | jaspar2022: 1956
## | jolma2013: 843
## | stamlab: 683
## | 62 organism/s
## | Hsapiens: 6075
## | Mmusculus: 1554
## | Dmelanogaster: 1437
## | Athaliana: 1371
## | Scerevisiae: 1221
## | NA: 184
## | other: 815
GATA3
GATA3
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
A | 0.22 | 0.68 | 0.00 | 1.00 | 0.00 | 0.88 | 0.51 | 0.29 |
C | 0.32 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | 0.24 |
G | 0.36 | 0.02 | 1.00 | 0.00 | 0.00 | 0.01 | 0.35 | 0.38 |
T | 0.10 | 0.22 | 0.00 | 0.00 | 1.00 | 0.11 | 0.12 | 0.09 |
GATA3
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
A | 0.22 | 0.68 | 0.00 | 1.00 | 0.00 | 0.88 | 0.51 | 0.29 |
C | 0.32 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | 0.24 |
G | 0.36 | 0.02 | 1.00 | 0.00 | 0.00 | 0.01 | 0.35 | 0.38 |
T | 0.10 | 0.22 | 0.00 | 0.00 | 1.00 | 0.11 | 0.12 | 0.09 |
Sequence with a SNP
Ref:TAGTGAGATAAGTA
Alt:TAGTGAGAAAAGTA
SNP
Scoring
Sequence with a SNP:
GATA3 Motif Scans
Ref:TAGTGAGATAAGTACC
GATA3 Motif Scans:
Scoring a Frame
Ref:TAGTGAGATAAGTA
Pos:00001234567800
GATA3 Motif Scans:
Scoring a Frame
GAGATAAGTA0
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
A | 0.22 | 0.68 | 0.00 | 1.00 | 0.00 | 0.88 | 0.51 | 0.29 |
C | 0.32 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | 0.24 |
G | 0.36 | 0.02 | 1.00 | 0.00 | 0.00 | 0.01 | 0.35 | 0.38 |
T | 0.10 | 0.22 | 0.00 | 0.00 | 1.00 | 0.11 | 0.12 | 0.09 |
| 0.36 | 0.68 | 1.00 | 1.00 | 1.00 | 0.88 | 0.51 | 0.38 |
GAGATAAG
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
G | 0.36 | 0.02 | 1.00 | 0.00 | 0.00 | 0.01 | 0.35 | 0.38 | 0.36 |
A | 0.22 | 0.68 | 0.00 | 1.00 | 0.00 | 0.88 | 0.51 | 0.29 | 0.68 |
G | 0.36 | 0.02 | 1.00 | 0.00 | 0.00 | 0.01 | 0.35 | 0.38 | 1.00 |
A | 0.22 | 0.68 | 0.00 | 1.00 | 0.00 | 0.88 | 0.51 | 0.29 | 1.00 |
T | 0.10 | 0.22 | 0.00 | 0.00 | 1.00 | 0.11 | 0.12 | 0.09 | 1.00 |
A | 0.22 | 0.68 | 0.00 | 1.00 | 0.00 | 0.88 | 0.51 | 0.29 | 0.88 |
A | 0.22 | 0.68 | 0.00 | 1.00 | 0.00 | 0.88 | 0.51 | 0.29 | 0.51 |
G | 0.36 | 0.02 | 1.00 | 0.00 | 0.00 | 0.01 | 0.35 | 0.38 | 0.38 |
GATA3 Motif Scans:
Scoring a Frame�Alternate Allele
GAGAAAAGTA0
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
A | 0.22 | 0.68 | 0.00 | 1.00 | 0.00 | 0.88 | 0.51 | 0.29 |
C | 0.32 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | 0.24 |
G | 0.36 | 0.02 | 1.00 | 0.00 | 0.00 | 0.01 | 0.35 | 0.38 |
T | 0.10 | 0.22 | 0.00 | 0.00 | 1.00 | 0.11 | 0.12 | 0.09 |
| 0.36 | 0.68 | 1.00 | 1.00 | 0.00 | 0.88 | 0.51 | .38 |
Comparing Scores:
Sum of Log Probabilities
Alt Allele | |
G | 0.36 |
A | 0.68 |
G | 1.00 |
A | 1.00 |
A | 0.00 |
A | 0.88 |
A | 0.51 |
G | 0.38 |
Ref Allele | |
G | 0.36 |
A | 0.68 |
G | 1.00 |
A | 1.00 |
T | 1.00 |
A | 0.88 |
A | 0.51 |
G | 0.38 |
Comparing Scores:
Weighted Sum Method
Alt Allele | |
G | 0.36 |
A | 0.68 |
G | 1.00 |
A | 1.00 |
A | 0.00 |
A | 0.88 |
A | 0.51 |
G | 0.38 |
Ref Allele | |
G | 0.36 |
A | 0.68 |
G | 1.00 |
A | 1.00 |
T | 1.00 |
A | 0.88 |
A | 0.51 |
G | 0.38 |
Comparing Scores:
Information Content Method
Alt Allele | |
G | 0.36 |
A | 0.68 |
G | 1.00 |
A | 1.00 |
A | 0.00 |
A | 0.88 |
A | 0.51 |
G | 0.38 |
Ref Allele | |
G | 0.36 |
A | 0.68 |
G | 1.00 |
A | 1.00 |
T | 1.00 |
A | 0.88 |
A | 0.51 |
G | 0.38 |
Base Probability
Information Content
GATA3 - calculate ωM
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
A | 0.22 | 0.68 | 0.00 | 1.00 | 0.00 | 0.88 | 0.51 | 0.29 |
C | 0.32 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | 0.24 |
G | 0.36 | 0.02 | 1.00 | 0.00 | 0.00 | 0.01 | 0.35 | 0.38 |
T | 0.10 | 0.22 | 0.00 | 0.00 | 1.00 | 0.11 | 0.12 | 0.09 |
Min | 0.10 | 0.02 | 0.00 | 0.00 | 0.00 | 0.01 | 0.02 | 0.09 |
Max | 0.36 | 0.68 | 1.00 | 1.00 | 1.00 | 0.88 | 0.51 | 0.38 |
| 0.26 | 0.66 | 1.00 | 1.00 | 1.00 | 0.87 | 0.49 | 0.29 |
GATA3 - calculate ωM
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| 0.26 | 0.66 | 1.00 | 1.00 | 1.00 | 0.87 | 0.49 | 0.29 |
| × | × | × | × | × | × | × | × |
Ref | 0.36 | 0.68 | 1.00 | 1.00 | 1.00 | 0.88 | 0.51 | 0.38 |
Alt | 0.36 | 0.68 | 1.00 | 1.00 | 0.00 | 0.88 | 0.51 | 0.38 |
GATA3 - calculate ωM
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
A | 0.06 | 0.45 | 0.00 | 1.00 | 0.00 | 0.77 | 0.25 | 0.08 |
C | 0.08 | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.07 |
G | 0.09 | 0.01 | 1.00 | 0.00 | 0.00 | 0.01 | 0.17 | 0.11 |
T | 0.02 | 0.16 | 0.00 | 0.00 | 1.00 | 0.10 | 0.06 | 0.03 |
MotifbreakR 2.0
Hey, I know, let’s add indels!
Also, wouldn’t it be nice to know if binding of the TF actually occurs at that site?
And let’s make sharing of the tool and output easier to do.
What kind of features
ATCTCAGGGAGT - reference CAG
ATCTCAGATGGAGT alt CAGAT
ATCTTAGGGAGT TAG
ATCTCGGAGT C
ATCTCAGGGAGT reference CAGGG
ATCTCGGAGT alt CGG
So Handling INDELs are hard
For variants that have single base REF and ALT aka SNV
Normal motifbreakR
If there either REF or ALT is a single base, easy to find location of INDEL
What kind of features - simple
ATCTCAGGGAGT - reference CAG
ATCTCAGATGGAGT alt CAGAT
ATCTTAGGGAGT TAG
ATCTCGGAGT C
ATCTCAGGGAGT reference CAGGG
ATCTCGGAGT alt CGG
So Handling INDELs are hard
For variants that have single base REF and ALT aka SNV
Normal motifBreakR
If there either REF or ALT is a single base, easy to find location of INDEL
If both REF and ALT are more than one base:
Pairwise alignment
What kind of features - complex
ATCTCAGGGAGT - reference CAG
ATCTCAGATGGAGT alt CAGAT
ATCTTAGGGAGT TAG
ATCTCGGAGT C
ATCTCAGGGAGT reference CAGGG
ATCTCGGAGT alt CGG
Determining Location
SNV: bp location of variant
INDEL of equal length: just an SNV incorrectly labeled (mostly)
Deletion - Reference positions available for all positions, still using relative positions
12345�ATCTCAGGGAGT reference CAG�ATCTC GGAGT alt C alt location c(2,3)
ATCTCAGGGAGT reference CAGGG�ATCTC GGAGT alt CGG alt location c(2,3)
Determining Location
SNV: bp location of variant
INDEL of equal length: just an SNV incorrectly labeled
Insertion: No reference position for two inserted bases. Using relative position.
12345�ATCTCAG--GGAGT - reference CAG
ATCTCAGATGGAGT alt CAGAT alt location c(4,5)
The Kinds of Motif Matches
Calculate P-value
Usage of PWMs needs as a prerequisite to knowing the statistical significance of a sequence according to its score.
We use:
Proves the problem is NP-hard.
Then, describes a novel algorithm that solves the P-value problem efficiently.
Setting Granularity
Sometimes it’s still slow, so we allow the setting of PWM granularity.
This trades the accuracy of the p-value calculation for speed of convergence. For most purposes a range of 1e-4 to 1e-6 is an acceptable trade off between accuracy and speed.
Let 𝑀 be a matrix of real coefficient values of length 𝑚 and let 𝜖 be a positive real number. We denote 𝑀𝜖 the round matrix deduced from 𝑀 by rounding each value by 𝜖:
Find Matching Peaks
While TF binding is based partly on sequence preference, predictions of TF binding based on sequence preference alone can indicate many more potential binding events than observed.
We believe that grounding the analysis in observed transcription factor binding can improve the prioritization of results.
Remap2022
R/Shiny
rs143969848
Thank you
Hazelett Lab:
Dr. Dennis Hazelett
Peter Nguyen
Irina Silacheva
CBM
Dr. Jason Moore
MotifbreakR v2: extended capability and database integration
now available on arXiv:
https://www.arxiv.org/abs/2407.03441