1 of 16

Seminar 13

Genomic

Foundation

Models

2 of 16

Submitted to ArXiv: 6 Dec 2024

Also in: NeurIPS Datasets and Benchmarks 2024

3 of 16

Zero-shot: embeddings

4 of 16

Zero-shot: likelihood

Sum up all log likelihoods of tokens in the sequence to get a (quasi)likelihood of the entire sequence, i.e., regulatory element

5 of 16

Supervised: probing or fine-tuning

6 of 16

Models evaluated

7 of 16

Task 1

Task objective

Test if the model can distinguish between regulatory elements and compositionally matched negative sequences

Method

Data: 2.3 million regulatory elements, length 350bp

Zero-shot: compare paired likelihoods

Supervised: probing with final embeddings and fine-tuning

Results

Zero-shot: as effective as the supervised setting

Supervised: both fine-tuning and probing have slightly higher accuracy than the ab initio model

8 of 16

Task 3

Task objective

Evaluating whether representations learned by DNALMs encode cell-type specific

regulatory sequence features.

Method

Data: cell-type specific regulatory elements based on ATAC-seq chromatin accessibility experiments.

Zero-shot: clustered model embeddings of the cell-type specific regulatory sequences using k-means and quantified label separation using the adjusted Mutual Information Score across labels.

Supervised: overall accuracy and binary classification metrics (accuracy, AUROC, AUPRC) for each cell type versus the other.

Results

Zero-shot: DNALMs failed to separate sequences by cell type, simple motif-counting baseline worked better than all dnalms. UMAP showed no clear separation of cell types in DNALMS embeddings

Supervised: Probed DNALMs - 38% (GENA-LM) best accuracy and best auroc per cell is 0.6 to 0.8�Fine tuned DNALMS - 67%(caduceus) and best auroc per cell is 0.87 to 0.94�Ab initio (chromBPnet-like) 66.7% and auroc per cell is 0.84 to 0.90

9 of 16

Task 4

Task objective:

Evaluate if DNALMs can predict quantitative chromatin accessibility from DNA sequence

Method

Data: 2 kb genomic sequences labeled with DNase-seq signal from five ENCODE cell types

Zero-shot: N/A

Supervised: DNALMs fine-turned end-to-end on the S2A regression

Results

Zero-shot: not evaluated

Supervised: fine-tuned DNALMs showed strong performance, nearly matching ab-initio baseline CNN on regression and classification; probing underperformed.

10 of 16

Task 5

Task objective

Test how well the model can predict the effects of genetic variation on chromatin accessibility

Method

Data: 2 QTL studies that associate genetic variation with variation of chromatin accessibility from ATAC-seq or DNase-seq experiments across a large cohort of lymphoblastoid cell lines (LCLs) from individuals of African ancestry

Zero-shot: cosine distance, log-likelihood

Supervised: absolute difference in predicted accessibility between the two alleles

Results

Zero-shot: NT achieved the best performance for both yoruba and african datasets

Supervised: fine-tuned sequence to-activity models we re better than the probed counterparts.

Insight: fine-tuned models underperformed the ab-initio baseline ChromBPNet in variant effect prediction which verified the importance of including counterfactual tasks in evaluations alongside observational assessments

11 of 16

Variant scoring

12 of 16

Variant scoring

multi-species

13 of 16

Variant scoring

The largest context length

14 of 16

Conclusions

1. Via unsupervised training, DNALMs learn some biologically relevant representations.

2. Simpler ab initio supervised models match or exceed the performance of much larger,

fine-tuned DNALMs.

3. DNALMs perform particularly poorly on variant effect prediction task.

15 of 16

Next time

Final seminar -

summary & sweets

No paper questions, but answers to a couple of course summary questions

16 of 16

See you next time!