1 of 26

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

Kravchuk Ekaterina,

1st year student of the Master's program at the Faculty of Biology, Moscow State University,

Department of Bioengineering

Курс «Нейронные сети и их применение в научных исследованиях»

2 of 26

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

Super enhancers

(Kravchuk et al., 2023)

3 of 26

Супер энхансеры

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

typical enhancers and super enhancers(Kravchuk et al., 2023)

4 of 26

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

Pott et al., 2015

5 of 26

Existing models: ROSE

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

(Whyte et al., 2013; Lovén et al., 2013)

1) Stitching of enhancer elements in a SE if the genomic distance between them is less than 12.5 kb

2) Ranking of obtained regions according to the level of chosen SE mark signal

3)Sorting into two groups using specific cut-off point

4) Filtering based on the distance between SE and transcriptional start: disregarded if closer than 2.5 kb (optional)

6 of 26

Existing models: imPROSE

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

(Khan et al., 2019)

7 of 26

Existing models: DEEPSEN

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

(Bu et al., 2019)

8 of 26

Existing models: DeepSE

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

(Ji et al., 2021)

9 of 26

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

(Ji et al., 2021)

10 of 26

DeepSE

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

(Ji et al., 2021)

11 of 26

  • For typical enhancers (TE) annotation (TE) - A hybrid model based on CNN and RNN for predicting enhancer regions with histone modification marks as input (Lim et al., 2019)
  • BERT based CNN architecture with only DNA sequence as input for TE prediction (Le et al., 2021)
  • DNABERT(Ji et al., 2021)
  • BigBird (Zaheer et al., 2020)

Models for solving similar problems

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

12 of 26

Data

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

Genome

Dataset

TE

SE

SE:TE

Mouse

mESC (constituent)

9970

645

1:15

Mouse

mESC

8562

231

1:37

Mouse

Myotube

4769

535

1:9

Mouse

Macrophage

9514

961

1:10

Mouse

Pro-B cell

13419

396

1:34

Mouse

Th-cell

18129

436

1:41

Human

MM1.S

11685

640

1:18

Human

H2171

16354

357

1:46

Human

U87

19231

1073

1:18

Mouse

imPROSE (mESC)

9374

646

1:15

13 of 26

Models: DNABERT

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

(Ji et al., 2021)

14 of 26

Models: DNABERT

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

(Ji et al., 2021)

15 of 26

DNABERT fine-tuning results

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

Dataset

Model

accuracy

auc

F1

mcc

precision

Recall

imPROSE (mESC)

BertForLongSequenceClassification

0.928

0.479

0.481

0.0

0.464

0.5

imPROSE (mESC)

BertForLongSequenceClassificationCat

0.928

0.536

0.481

0.0

0.464

0.5

myotube

BertForLongSequenceClassification

0.63

0.57

0.386

0.0

0.315

0.5

myotube

BertForLongSequenceClassificationCat

0.457

0.461

0.438

-0.11

0.433

0.439

16 of 26

Models: BigBird

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

(Zaheer et al., 2020)

17 of 26

Models: BigBird

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

Used hyperparameters

num_train_epochs = 5, per_device_train_batch_size = 2, gradient_accumulation_steps = 32, per_device_eval_batch_size= 16, weight_decay=0.01,

learning_rate = 1e-4 (for classifier fine-tuning) or 1e-5 (for full model fine-tuning)

Loss function

Cross-entropy loss

18 of 26

Models: BigBird

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

Differences between GENA-LM (BigBird-base T2T) and DNABERT:

  • BPE tokenization instead of k-mers;
  • input sequence size is about 24000 nucleotides (4096 BPE tokens) compared to 510 nucleotides of DNABERT;
  • pre-training on T2T vs. GRCh38.p13 human genome assembly.

19 of 26

Data

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

Genome

Название датасета

TE

SE

SE:TE

Mouse

mESC (constituent)

9970

645

1:15

Mouse

mESC

8562

231

1:37

Mouse

Myotube

4769

535

1:9

Mouse

Macrophage

9514

961

1:10

Mouse

Pro-B cell

13419

396

1:34

Mouse

Th-cell

18129

436

1:41

Human

MM1.S

11685

640

1:18

Human

H2171

16354

357

1:46

Human

U87

19231

1073

1:18

Mouse

imPROSE (mESC)

9374

646

1:15

20 of 26

Distribution of the length of sequence in tokens

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

21 of 26

GENA-LM fine-tuning

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

22 of 26

GENA-LM fine-tuning

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

23 of 26

GENA-LM fine-tuning results

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

Balanced accuracy

AUC

precision

recall

F1

mcc

Classifier fine-tuning

0.5729

0.5729

0.9247

0.8141

0.8622

0.0797

Full model fine-tuning

0.6010

0.6010

0.9281

0.7789

0.8409

0.1012

24 of 26

Captum results

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

25 of 26

Captum results visualization in UCSC browser

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

26 of 26

  • Kravchuk EV, Ashniev GA, Gladkova MG, Orlov AV, Vasileva AV, Boldyreva AV, Burenin AG, Skirda AM, Nikitin PI, Orlova NN. Experimental Validation and Prediction of Super-Enhancers: Advances and Challenges. Cells. 2023 Apr 19;12(8):1191. doi: 10.3390/cells12081191. PMID: 37190100; PMCID: PMC10136858.
  • Ji QY, Gong XJ, Li HM, Du PF. DeepSE: Detecting super-enhancers among typical enhancers using only sequence feature embeddings. Genomics. 2021 Nov;113(6):4052-4060. doi: 10.1016/j.ygeno.2021.10.007. Epub 2021 Oct 16. PMID: 34666191.
  • Pott S, Lieb JD. What are super-enhancers? Nat Genet. 2015 Jan;47(1):8-12. doi: 10.1038/ng.3167. PMID: 25547603.
  • Whyte WA, Orlando DA, Hnisz D, Abraham BJ, Lin CY, Kagey MH, Rahl PB, Lee TI, Young RA. Master transcription factors and mediator establish super-enhancers at key cell identity genes. Cell. 2013 Apr 11;153(2):307-19. doi: 10.1016/j.cell.2013.03.035. PMID: 23582322; PMCID: PMC3653129.
  • Lovén J, Hoke HA, Lin CY, Lau A, Orlando DA, Vakoc CR, Bradner JE, Lee TI, Young RA. Selective inhibition of tumor oncogenes by disruption of super-enhancers. Cell. 2013 Apr 11;153(2):320-34. doi: 10.1016/j.cell.2013.03.036. PMID: 23582323; PMCID: PMC3760967.
  • Khan, A., Zhang, X. Integrative modeling reveals key chromatin and sequence signatures predicting super-enhancers. Sci Rep 9, 2877 (2019). https://doi.org/10.1038/s41598-019-38979-9
  • Bu H, Hao J, Gan Y, Zhou S, Guan J. DEEPSEN: a convolutional neural network based method for super-enhancer prediction. BMC Bioinformatics. 2019 Dec 24;20(Suppl 15):598. doi: 10.1186/s12859-019-3180-z. PMID: 31874597; PMCID: PMC6929276.
  • Lim A, Lim S, Kim S. Enhancer prediction with histone modification marks using a hybrid neural network model. Methods. 2019 Aug 15;166:48-56. doi: 10.1016/j.ymeth.2019.03.014. Epub 2019 Mar 21. PMID: 30905748.
  • Le NQK, Ho QT, Nguyen TT, Ou YY. A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information. Brief Bioinform. 2021 Sep 2;22(5):bbab005. doi: 10.1093/bib/bbab005. PMID: 33539511.
  • Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics. 2021 Feb 4:btab083. doi: 10.1093/bioinformatics/btab083. Epub ahead of print. PMID: 33538820.
  • Zaheer, Manzil, et al. "Big bird: Transformers for longer sequences." Advances in neural information

processing systems 33 (2020): 17283-17297.

References

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks