Development of methods for predicting and annotating super enhancers by DNA
sequence using neural networks
Kravchuk Ekaterina,
1st year student of the Master's program at the Faculty of Biology, Moscow State University,
Department of Bioengineering
Курс «Нейронные сети и их применение в научных исследованиях»
Development of methods for predicting and annotating super enhancers by DNA
sequence using neural networks
Super enhancers
(Kravchuk et al., 2023)
Супер энхансеры
Development of methods for predicting and annotating super enhancers by DNA
sequence using neural networks
typical enhancers and super enhancers�(Kravchuk et al., 2023)
Development of methods for predicting and annotating super enhancers by DNA
sequence using neural networks
Pott et al., 2015
Existing models: ROSE
Development of methods for predicting and annotating super enhancers by DNA
sequence using neural networks
(Whyte et al., 2013; Lovén et al., 2013)
1) Stitching of enhancer elements in a SE if the genomic distance between them is less than 12.5 kb
2) Ranking of obtained regions according to the level of chosen SE mark signal
3)Sorting into two groups using specific cut-off point
4) Filtering based on the distance between SE and transcriptional start: disregarded if closer than 2.5 kb (optional)
Existing models: imPROSE
Development of methods for predicting and annotating super enhancers by DNA
sequence using neural networks
(Khan et al., 2019)
Existing models: DEEPSEN
Development of methods for predicting and annotating super enhancers by DNA
sequence using neural networks
(Bu et al., 2019)
Existing models: DeepSE
Development of methods for predicting and annotating super enhancers by DNA
sequence using neural networks
(Ji et al., 2021)
Development of methods for predicting and annotating super enhancers by DNA
sequence using neural networks
(Ji et al., 2021)
DeepSE
Development of methods for predicting and annotating super enhancers by DNA
sequence using neural networks
(Ji et al., 2021)
Models for solving similar problems
Development of methods for predicting and annotating super enhancers by DNA
sequence using neural networks
Data
Development of methods for predicting and annotating super enhancers by DNA
sequence using neural networks
Genome | Dataset | TE | SE | SE:TE |
Mouse | mESC (constituent) | 9970 | 645 | 1:15 |
Mouse | mESC | 8562 | 231 | 1:37 |
Mouse | Myotube | 4769 | 535 | 1:9 |
Mouse | Macrophage | 9514 | 961 | 1:10 |
Mouse | Pro-B cell | 13419 | 396 | 1:34 |
Mouse | Th-cell | 18129 | 436 | 1:41 |
Human | MM1.S | 11685 | 640 | 1:18 |
Human | H2171 | 16354 | 357 | 1:46 |
Human | U87 | 19231 | 1073 | 1:18 |
Mouse | imPROSE (mESC) | 9374 | 646 | 1:15 |
Models: DNABERT
Development of methods for predicting and annotating super enhancers by DNA
sequence using neural networks
(Ji et al., 2021)
Models: DNABERT
Development of methods for predicting and annotating super enhancers by DNA
sequence using neural networks
(Ji et al., 2021)
DNABERT fine-tuning results
Development of methods for predicting and annotating super enhancers by DNA
sequence using neural networks
Dataset | Model | accuracy | auc | F1 | mcc | precision | Recall |
imPROSE (mESC) | BertForLongSequenceClassification | 0.928 | 0.479 | 0.481 | 0.0 | 0.464 | 0.5 |
imPROSE (mESC) | BertForLongSequenceClassificationCat | 0.928 | 0.536 | 0.481 | 0.0 | 0.464 | 0.5 |
myotube | BertForLongSequenceClassification | 0.63 | 0.57 | 0.386 | 0.0 | 0.315 | 0.5 |
myotube | BertForLongSequenceClassificationCat | 0.457 | 0.461 | 0.438 | -0.11 | 0.433 | 0.439 |
Models: BigBird
Development of methods for predicting and annotating super enhancers by DNA
sequence using neural networks
(Zaheer et al., 2020)
Models: BigBird
Development of methods for predicting and annotating super enhancers by DNA
sequence using neural networks
Used hyperparameters | num_train_epochs = 5, per_device_train_batch_size = 2, gradient_accumulation_steps = 32, per_device_eval_batch_size= 16, weight_decay=0.01, learning_rate = 1e-4 (for classifier fine-tuning) or 1e-5 (for full model fine-tuning) |
Loss function | Cross-entropy loss |
Models: BigBird
Development of methods for predicting and annotating super enhancers by DNA
sequence using neural networks
Differences between GENA-LM (BigBird-base T2T) and DNABERT:
Data
Development of methods for predicting and annotating super enhancers by DNA
sequence using neural networks
Genome | Название датасета | TE | SE | SE:TE |
Mouse | mESC (constituent) | 9970 | 645 | 1:15 |
Mouse | mESC | 8562 | 231 | 1:37 |
Mouse | Myotube | 4769 | 535 | 1:9 |
Mouse | Macrophage | 9514 | 961 | 1:10 |
Mouse | Pro-B cell | 13419 | 396 | 1:34 |
Mouse | Th-cell | 18129 | 436 | 1:41 |
Human | MM1.S | 11685 | 640 | 1:18 |
Human | H2171 | 16354 | 357 | 1:46 |
Human | U87 | 19231 | 1073 | 1:18 |
Mouse | imPROSE (mESC) | 9374 | 646 | 1:15 |
Distribution of the length of sequence in tokens
Development of methods for predicting and annotating super enhancers by DNA
sequence using neural networks
GENA-LM fine-tuning
Development of methods for predicting and annotating super enhancers by DNA
sequence using neural networks
GENA-LM fine-tuning
Development of methods for predicting and annotating super enhancers by DNA
sequence using neural networks
GENA-LM fine-tuning results
Development of methods for predicting and annotating super enhancers by DNA
sequence using neural networks
| Balanced accuracy | AUC | precision | recall | F1 | mcc |
Classifier fine-tuning | 0.5729 | 0.5729 | 0.9247 | 0.8141 | 0.8622 | 0.0797 |
Full model fine-tuning | 0.6010 | 0.6010 | 0.9281 | 0.7789 | 0.8409 | 0.1012 |
Captum results
Development of methods for predicting and annotating super enhancers by DNA
sequence using neural networks
Captum results visualization in UCSC browser
Development of methods for predicting and annotating super enhancers by DNA
sequence using neural networks
processing systems 33 (2020): 17283-17297.
References
Development of methods for predicting and annotating super enhancers by DNA
sequence using neural networks