Seminar 13
Genomic
Foundation
Models
Submitted to ArXiv: 6 Dec 2024
Also in: NeurIPS Datasets and Benchmarks 2024
Zero-shot: embeddings
Zero-shot: likelihood
Sum up all log likelihoods of tokens in the sequence to get a (quasi)likelihood of the entire sequence, i.e., regulatory element
Supervised: probing or fine-tuning
Models evaluated
Task 1
Task objective
Test if the model can distinguish between regulatory elements and compositionally matched negative sequences
Method
Data: 2.3 million regulatory elements, length 350bp
Zero-shot: compare paired likelihoods
Supervised: probing with final embeddings and fine-tuning
Results
Zero-shot: as effective as the supervised setting
Supervised: both fine-tuning and probing have slightly higher accuracy than the ab initio model
Task 3
Task objective
Evaluating whether representations learned by DNALMs encode cell-type specific
regulatory sequence features.
Method
Data: cell-type specific regulatory elements based on ATAC-seq chromatin accessibility experiments.
Zero-shot: clustered model embeddings of the cell-type specific regulatory sequences using k-means and quantified label separation using the adjusted Mutual Information Score across labels.
Supervised: overall accuracy and binary classification metrics (accuracy, AUROC, AUPRC) for each cell type versus the other.
Results
Zero-shot: DNALMs failed to separate sequences by cell type, simple motif-counting baseline worked better than all dnalms. UMAP showed no clear separation of cell types in DNALMS embeddings
Supervised: Probed DNALMs - 38% (GENA-LM) best accuracy and best auroc per cell is 0.6 to 0.8�Fine tuned DNALMS - 67%(caduceus) and best auroc per cell is 0.87 to 0.94�Ab initio (chromBPnet-like) 66.7% and auroc per cell is 0.84 to 0.90
Task 4
Task objective:
Evaluate if DNALMs can predict quantitative chromatin accessibility from DNA sequence
Method
Data: 2 kb genomic sequences labeled with DNase-seq signal from five ENCODE cell types
Zero-shot: N/A
Supervised: DNALMs fine-turned end-to-end on the S2A regression
Results
Zero-shot: not evaluated
Supervised: fine-tuned DNALMs showed strong performance, nearly matching ab-initio baseline CNN on regression and classification; probing underperformed.
Task 5
Task objective
Test how well the model can predict the effects of genetic variation on chromatin accessibility
Method
Data: 2 QTL studies that associate genetic variation with variation of chromatin accessibility from ATAC-seq or DNase-seq experiments across a large cohort of lymphoblastoid cell lines (LCLs) from individuals of African ancestry
Zero-shot: cosine distance, log-likelihood
Supervised: absolute difference in predicted accessibility between the two alleles
Results
Zero-shot: NT achieved the best performance for both yoruba and african datasets
Supervised: fine-tuned sequence to-activity models we re better than the probed counterparts.
Insight: fine-tuned models underperformed the ab-initio baseline ChromBPNet in variant effect prediction which verified the importance of including counterfactual tasks in evaluations alongside observational assessments
Variant scoring
Variant scoring
multi-species
Variant scoring
The largest context length
Conclusions
1. Via unsupervised training, DNALMs learn some biologically relevant representations.
2. Simpler ab initio supervised models match or exceed the performance of much larger,
fine-tuned DNALMs.
3. DNALMs perform particularly poorly on variant effect prediction task.
Next time
summary & sweets
See you next time!