1 of 26

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

Kravchuk Ekaterina,

1st year student of the Master's program at the Faculty of Biology, Moscow State University,

Department of Bioengineering

Курс «Нейронные сети и их применение в научных исследованиях»

2 of 26

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

Super enhancers

(Kravchuk et al., 2023)

3 of 26

Супер энхансеры

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

typical enhancers and super enhancers�(Kravchuk et al., 2023)

4 of 26

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

Pott et al., 2015

5 of 26

Existing models: ROSE

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

(Whyte et al., 2013; Lovén et al., 2013)

1) Stitching of enhancer elements in a SE if the genomic distance between them is less than 12.5 kb

2) Ranking of obtained regions according to the level of chosen SE mark signal

3)Sorting into two groups using specific cut-off point

4) Filtering based on the distance between SE and transcriptional start: disregarded if closer than 2.5 kb (optional)

6 of 26

Existing models: imPROSE

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

(Khan et al., 2019)

7 of 26

Existing models: DEEPSEN

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

(Bu et al., 2019)

8 of 26

Existing models: DeepSE

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

(Ji et al., 2021)

9 of 26

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

(Ji et al., 2021)

10 of 26

DeepSE

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

(Ji et al., 2021)

11 of 26

For typical enhancers (TE) annotation (TE) - A hybrid model based on CNN and RNN for predicting enhancer regions with histone modification marks as input (Lim et al., 2019)
BERT based CNN architecture with only DNA sequence as input for TE prediction (Le et al., 2021)
DNABERT(Ji et al., 2021)
BigBird (Zaheer et al., 2020)

Models for solving similar problems

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

12 of 26

Data

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

Genome	Dataset	TE	SE	SE:TE
Mouse	mESC (constituent)	9970	645	1:15
Mouse	mESC	8562	231	1:37
Mouse	Myotube	4769	535	1:9
Mouse	Macrophage	9514	961	1:10
Mouse	Pro-B cell	13419	396	1:34
Mouse	Th-cell	18129	436	1:41
Human	MM1.S	11685	640	1:18
Human	H2171	16354	357	1:46
Human	U87	19231	1073	1:18
Mouse	imPROSE (mESC)	9374	646	1:15

13 of 26

Models: DNABERT

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

(Ji et al., 2021)

14 of 26

Models: DNABERT

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

(Ji et al., 2021)

15 of 26

DNABERT fine-tuning results

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

Dataset	Model	accuracy	auc	F1	mcc	precision	Recall
imPROSE (mESC)	BertForLongSequenceClassification	0.928	0.479	0.481	0.0	0.464	0.5
imPROSE (mESC)	BertForLongSequenceClassificationCat	0.928	0.536	0.481	0.0	0.464	0.5
myotube	BertForLongSequenceClassification	0.63	0.57	0.386	0.0	0.315	0.5
myotube	BertForLongSequenceClassificationCat	0.457	0.461	0.438	-0.11	0.433	0.439

16 of 26

Models: BigBird

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

(Zaheer et al., 2020)

17 of 26

Models: BigBird

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

Used hyperparameters	num_train_epochs = 5, per_device_train_batch_size = 2, gradient_accumulation_steps = 32, per_device_eval_batch_size= 16, weight_decay=0.01, learning_rate = 1e-4 (for classifier fine-tuning) or 1e-5 (for full model fine-tuning)
Loss function	Cross-entropy loss

18 of 26

Models: BigBird

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

Differences between GENA-LM (BigBird-base T2T) and DNABERT:

BPE tokenization instead of k-mers;
input sequence size is about 24000 nucleotides (4096 BPE tokens) compared to 510 nucleotides of DNABERT;
pre-training on T2T vs. GRCh38.p13 human genome assembly.

19 of 26

Data

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

Genome	Название датасета	TE	SE	SE:TE
Mouse	mESC (constituent)	9970	645	1:15
Mouse	mESC	8562	231	1:37
Mouse	Myotube	4769	535	1:9
Mouse	Macrophage	9514	961	1:10
Mouse	Pro-B cell	13419	396	1:34
Mouse	Th-cell	18129	436	1:41
Human	MM1.S	11685	640	1:18
Human	H2171	16354	357	1:46
Human	U87	19231	1073	1:18
Mouse	imPROSE (mESC)	9374	646	1:15

20 of 26

Distribution of the length of sequence in tokens

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

21 of 26

GENA-LM fine-tuning

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

22 of 26

GENA-LM fine-tuning

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

23 of 26

GENA-LM fine-tuning results

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

	Balanced accuracy	AUC	precision	recall	F1	mcc
Classifier fine-tuning	0.5729	0.5729	0.9247	0.8141	0.8622	0.0797
Full model fine-tuning	0.6010	0.6010	0.9281	0.7789	0.8409	0.1012

24 of 26

Captum results

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

25 of 26

Captum results visualization in UCSC browser

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks

26 of 26

Kravchuk EV, Ashniev GA, Gladkova MG, Orlov AV, Vasileva AV, Boldyreva AV, Burenin AG, Skirda AM, Nikitin PI, Orlova NN. Experimental Validation and Prediction of Super-Enhancers: Advances and Challenges. Cells. 2023 Apr 19;12(8):1191. doi: 10.3390/cells12081191. PMID: 37190100; PMCID: PMC10136858.
Ji QY, Gong XJ, Li HM, Du PF. DeepSE: Detecting super-enhancers among typical enhancers using only sequence feature embeddings. Genomics. 2021 Nov;113(6):4052-4060. doi: 10.1016/j.ygeno.2021.10.007. Epub 2021 Oct 16. PMID: 34666191.
Pott S, Lieb JD. What are super-enhancers? Nat Genet. 2015 Jan;47(1):8-12. doi: 10.1038/ng.3167. PMID: 25547603.
Whyte WA, Orlando DA, Hnisz D, Abraham BJ, Lin CY, Kagey MH, Rahl PB, Lee TI, Young RA. Master transcription factors and mediator establish super-enhancers at key cell identity genes. Cell. 2013 Apr 11;153(2):307-19. doi: 10.1016/j.cell.2013.03.035. PMID: 23582322; PMCID: PMC3653129.
Lovén J, Hoke HA, Lin CY, Lau A, Orlando DA, Vakoc CR, Bradner JE, Lee TI, Young RA. Selective inhibition of tumor oncogenes by disruption of super-enhancers. Cell. 2013 Apr 11;153(2):320-34. doi: 10.1016/j.cell.2013.03.036. PMID: 23582323; PMCID: PMC3760967.
Khan, A., Zhang, X. Integrative modeling reveals key chromatin and sequence signatures predicting super-enhancers. Sci Rep 9, 2877 (2019). https://doi.org/10.1038/s41598-019-38979-9
Bu H, Hao J, Gan Y, Zhou S, Guan J. DEEPSEN: a convolutional neural network based method for super-enhancer prediction. BMC Bioinformatics. 2019 Dec 24;20(Suppl 15):598. doi: 10.1186/s12859-019-3180-z. PMID: 31874597; PMCID: PMC6929276.
Lim A, Lim S, Kim S. Enhancer prediction with histone modification marks using a hybrid neural network model. Methods. 2019 Aug 15;166:48-56. doi: 10.1016/j.ymeth.2019.03.014. Epub 2019 Mar 21. PMID: 30905748.
Le NQK, Ho QT, Nguyen TT, Ou YY. A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information. Brief Bioinform. 2021 Sep 2;22(5):bbab005. doi: 10.1093/bib/bbab005. PMID: 33539511.
Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics. 2021 Feb 4:btab083. doi: 10.1093/bioinformatics/btab083. Epub ahead of print. PMID: 33538820.
Zaheer, Manzil, et al. "Big bird: Transformers for longer sequences." Advances in neural information

processing systems 33 (2020): 17283-17297.

References

Development of methods for predicting and annotating super enhancers by DNA

sequence using neural networks