1 of 51

Simon G. Coetzee

Hazelett Lab

CBM

2 of 51

What does it do?

MotifbreakR is a software tool in R/Bioconductor that scans genetic variants against position weight matrices of TFs to determine the potential for the disruption of TF binding at the site of the variant.

3 of 51

Leveraging Bioconductor

Bioconductor contains the sequence of 76 genome builds across 22 species.

This allows for importing variants from a wide variety of experiment types, from human disease to model organisms.

Bioconductor package MotifDB contains 12,657 PWMs from across 22 databases and 62 organisms.

It’s possible, though not easy, to create your own MotifDB object for PWMs that you’ve found yourself.

4 of 51

Where is it used?

We originally published motifbreakR for variant analysis of genome-wide association studies (GWAS).

Since then: Many studies of human disease, from neurodegenerative disorders to COVID-19 research.

And even further afield …

5 of 51

Model organisms:

6 of 51

Stranger Organisms:

7 of 51

Paleoanthropology:

8 of 51

Evolutionary Genomics:

9 of 51

How do we get our data into motifbreakR?

BED format

VCFs

lists of rsIDs

and even “custom” variants chr:pos:ref:alt

Originally only SNVs, but now also INDELs

10 of 51

Practically motifbreakR has involves three phases.

  1. Read in Single Nucleotide Variants or Indels: The first step is to read in the list of variants. Variants can be a list of rsIDs if your SNPs are represented in one of the SNPlocs packages, may be included as a .bed file with a specifically formatted name field, or may be provided as a .vcf file. We then transform these input such that it may be read by motifbreakR.
  2. Find Broken Motifs: Next we present motifbreakR with the input generated in the previous step, along with a set of motifs formatted as class MotifList, and your preferred scoring method.
  3. Visualize SNPs and motifs: Finally we can visualize which motifs are broken by any individual SNP using our plotting function.

11 of 51

12 of 51

Built in Motif Databases

## MotifDb object of length 12657

## | Created from downloaded public sources, last update: 2022-Mar-04

## | 12657 position frequency matrices from 22 sources:

## | FlyFactorSurvey: 614

## | HOCOMOCOv10: 1066

## | HOCOMOCOv11-core-A: 181

## | HOCOMOCOv11-core-B: 84

## | HOCOMOCOv11-core-C: 135

## | HOCOMOCOv11-secondary-A: 46

## | HOCOMOCOv11-secondary-B: 19

## | HOCOMOCOv11-secondary-C: 13

## | HOCOMOCOv11-secondary-D: 290

## | HOMER: 332

## | JASPAR_2014: 592

## | JASPAR_CORE: 459

## | ScerTF: 196

## | SwissRegulon: 684

## | UniPROBE: 380

## | cisbp_1.02: 874

## | hPDI: 437

## | jaspar2016: 1209

## | jaspar2018: 1564

## | jaspar2022: 1956

## | jolma2013: 843

## | stamlab: 683

## | 62 organism/s

## | Hsapiens: 6075

## | Mmusculus: 1554

## | Dmelanogaster: 1437

## | Athaliana: 1371

## | Scerevisiae: 1221

## | NA: 184

## | other: 815

13 of 51

14 of 51

15 of 51

GATA3

16 of 51

GATA3

 

1

2

3

4

5

6

7

8

A

0.22

0.68

0.00

1.00

0.00

0.88

0.51

0.29

C

0.32

0.08

0.00

0.00

0.00

0.00

0.02

0.24

G

0.36

0.02

1.00

0.00

0.00

0.01

0.35

0.38

T

0.10

0.22

0.00

0.00

1.00

0.11

0.12

0.09

17 of 51

GATA3

 

1

2

3

4

5

6

7

8

A

0.22

0.68

0.00

1.00

0.00

0.88

0.51

0.29

C

0.32

0.08

0.00

0.00

0.00

0.00

0.02

0.24

G

0.36

0.02

1.00

0.00

0.00

0.01

0.35

0.38

T

0.10

0.22

0.00

0.00

1.00

0.11

0.12

0.09

18 of 51

Sequence with a SNP

Ref:TAGTGAGATAAGTA

Alt:TAGTGAGAAAAGTA

SNP

19 of 51

Scoring

  1. Define searchable window
    1. at least 1 base of motif must overlap variant
  2. Find Best Match in reference allele anywhere in the variant.
  3. Find Best Match in alternate allele anywhere in the variant.
  4. Calculate score difference.

20 of 51

Sequence with a SNP:

GATA3 Motif Scans

Ref:TAGTGAGATAAGTACC

21 of 51

GATA3 Motif Scans:

Scoring a Frame

Ref:TAGTGAGATAAGTA

Pos:00001234567800

22 of 51

GATA3 Motif Scans:

Scoring a Frame

GAGATAAGTA0

 

1

2

3

4

5

6

7

8

A

0.22

0.68

0.00

1.00

0.00

0.88

0.51

0.29

C

0.32

0.08

0.00

0.00

0.00

0.00

0.02

0.24

G

0.36

0.02

1.00

0.00

0.00

0.01

0.35

0.38

T

0.10

0.22

0.00

0.00

1.00

0.11

0.12

0.09

0.36

0.68

1.00

1.00

1.00

0.88

0.51

0.38

23 of 51

GAGATAAG

1

2

3

4

5

6

7

8

G

0.36

0.02

1.00

0.00

0.00

0.01

0.35

0.38

0.36

A

0.22

0.68

0.00

1.00

0.00

0.88

0.51

0.29

0.68

G

0.36

0.02

1.00

0.00

0.00

0.01

0.35

0.38

1.00

A

0.22

0.68

0.00

1.00

0.00

0.88

0.51

0.29

1.00

T

0.10

0.22

0.00

0.00

1.00

0.11

0.12

0.09

1.00

A

0.22

0.68

0.00

1.00

0.00

0.88

0.51

0.29

0.88

A

0.22

0.68

0.00

1.00

0.00

0.88

0.51

0.29

0.51

G

0.36

0.02

1.00

0.00

0.00

0.01

0.35

0.38

0.38

24 of 51

GATA3 Motif Scans:

Scoring a Frame�Alternate Allele

GAGAAAAGTA0

 

1

2

3

4

5

6

7

8

A

0.22

0.68

0.00

1.00

0.00

0.88

0.51

0.29

C

0.32

0.08

0.00

0.00

0.00

0.00

0.02

0.24

G

0.36

0.02

1.00

0.00

0.00

0.01

0.35

0.38

T

0.10

0.22

0.00

0.00

1.00

0.11

0.12

0.09

0.36

0.68

1.00

1.00

0.00

0.88

0.51

.38

25 of 51

Comparing Scores:

Sum of Log Probabilities

Alt Allele

G

0.36

A

0.68

G

1.00

A

1.00

A

0.00

A

0.88

A

0.51

G

0.38

Ref Allele

G

0.36

A

0.68

G

1.00

A

1.00

T

1.00

A

0.88

A

0.51

G

0.38

 

 

 

26 of 51

Comparing Scores:

Weighted Sum Method

Alt Allele

G

0.36

A

0.68

G

1.00

A

1.00

A

0.00

A

0.88

A

0.51

G

0.38

Ref Allele

G

0.36

A

0.68

G

1.00

A

1.00

T

1.00

A

0.88

A

0.51

G

0.38

 

 

 

27 of 51

Comparing Scores:

Information Content Method

Alt Allele

G

0.36

A

0.68

G

1.00

A

1.00

A

0.00

A

0.88

A

0.51

G

0.38

Ref Allele

G

0.36

A

0.68

G

1.00

A

1.00

T

1.00

A

0.88

A

0.51

G

0.38

 

 

 

28 of 51

Base Probability

Information Content

29 of 51

GATA3 - calculate ωM 

 

1

2

3

4

5

6

7

8

A

0.22

0.68

0.00

1.00

0.00

0.88

0.51

0.29

C

0.32

0.08

0.00

0.00

0.00

0.00

0.02

0.24

G

0.36

0.02

1.00

0.00

0.00

0.01

0.35

0.38

T

0.10

0.22

0.00

0.00

1.00

0.11

0.12

0.09

Min

0.10

0.02

0.00

0.00

0.00

0.01

0.02

0.09

Max

0.36

0.68

1.00

1.00

1.00

0.88

0.51

0.38

0.26

0.66

1.00

1.00

1.00

0.87

0.49

0.29

30 of 51

GATA3 - calculate ωM 

 

1

2

3

4

5

6

7

8

0.26

0.66

1.00

1.00

1.00

0.87

0.49

0.29

×

×

×

×

×

×

×

×

Ref

0.36

0.68

1.00

1.00

1.00

0.88

0.51

0.38

Alt

0.36

0.68

1.00

1.00

0.00

0.88

0.51

0.38

31 of 51

GATA3 - calculate ωM 

 

1

2

3

4

5

6

7

8

A

0.06

0.45

0.00

1.00

0.00

0.77

0.25

0.08

C

0.08

0.05

0.00

0.00

0.00

0.00

0.01

0.07

G

0.09

0.01

1.00

0.00

0.00

0.01

0.17

0.11

T

0.02

0.16

0.00

0.00

1.00

0.10

0.06

0.03

32 of 51

33 of 51

MotifbreakR 2.0

Hey, I know, let’s add indels!

Also, wouldn’t it be nice to know if binding of the TF actually occurs at that site?

And let’s make sharing of the tool and output easier to do.

34 of 51

What kind of features

ATCTCAGGGAGT - reference CAG

ATCTCAGATGGAGT alt CAGAT

ATCTTAGGGAGT TAG

ATCTCGGAGT C

ATCTCAGGGAGT reference CAGGG

ATCTCGGAGT alt CGG

35 of 51

So Handling INDELs are hard

For variants that have single base REF and ALT aka SNV

Normal motifbreakR

If there either REF or ALT is a single base, easy to find location of INDEL

36 of 51

What kind of features - simple

ATCTCAGGGAGT - reference CAG

ATCTCAGATGGAGT alt CAGAT

ATCTTAGGGAGT TAG

ATCTCGGAGT C

ATCTCAGGGAGT reference CAGGG

ATCTCGGAGT alt CGG

37 of 51

So Handling INDELs are hard

For variants that have single base REF and ALT aka SNV

Normal motifBreakR

If there either REF or ALT is a single base, easy to find location of INDEL

If both REF and ALT are more than one base:

Pairwise alignment

38 of 51

What kind of features - complex

ATCTCAGGGAGT - reference CAG

ATCTCAGATGGAGT alt CAGAT

ATCTTAGGGAGT TAG

ATCTCGGAGT C

ATCTCAGGGAGT reference CAGGG

ATCTCGGAGT alt CGG

39 of 51

Determining Location

SNV: bp location of variant

INDEL of equal length: just an SNV incorrectly labeled (mostly)

Deletion - Reference positions available for all positions, still using relative positions

12345�ATCTCAGGGAGT reference CAG�ATCTC GGAGT alt C alt location c(2,3)

ATCTCAGGGAGT reference CAGGG�ATCTC GGAGT alt CGG alt location c(2,3)

40 of 51

Determining Location

SNV: bp location of variant

INDEL of equal length: just an SNV incorrectly labeled

Insertion: No reference position for two inserted bases. Using relative position.

12345�ATCTCAG--GGAGT - reference CAG

ATCTCAGATGGAGT alt CAGAT alt location c(4,5)

41 of 51

The Kinds of Motif Matches

42 of 51

43 of 51

44 of 51

45 of 51

Calculate P-value

Usage of PWMs needs as a prerequisite to knowing the statistical significance of a sequence according to its score.

We use:

Proves the problem is NP-hard.

Then, describes a novel algorithm that solves the P-value problem efficiently.

46 of 51

Setting Granularity

Sometimes it’s still slow, so we allow the setting of PWM granularity.

This trades the accuracy of the p-value calculation for speed of convergence. For most purposes a range of 1e-4 to 1e-6 is an acceptable trade off between accuracy and speed.

Let 𝑀 be a matrix of real coefficient values of length 𝑚 and let 𝜖 be a positive real number. We denote 𝑀𝜖 the round matrix deduced from 𝑀 by rounding each value by 𝜖:

47 of 51

Find Matching Peaks

While TF binding is based partly on sequence preference, predictions of TF binding based on sequence preference alone can indicate many more potential binding events than observed.

We believe that grounding the analysis in observed transcription factor binding can improve the prioritization of results.

48 of 51

Remap2022

49 of 51

50 of 51

R/Shiny

rs143969848

51 of 51

Thank you

Hazelett Lab:

Dr. Dennis Hazelett

Peter Nguyen

Irina Silacheva

CBM

Dr. Jason Moore

MotifbreakR v2: extended capability and database integration

now available on arXiv:

https://www.arxiv.org/abs/2407.03441