DepMap & Celligner2
Jérémie Kalfon -
DepMap: pan-Cancer biomarker & target discovery
About me
About me
CaImAn an open source tool for scalable calcium imaging data analysis | eLife, Hidden patterns of codon usage bias across kingdoms | Journal of The Royal Society Interface,
At the Broad Institute
A distinct core regulatory module enforces oncogene expression in KMT2A-rearranged leukemia | G&D, �Leukemia core transcriptional circuitry is a sparsely interconnected hierarchy stabilized by incoherent feed-forward loops. | in review, available on request
AML group @DFCI : Epigenomics and dependency landscape of Leukemias. ��→ Developed tools for large scale epigenomics. �→ Found role of KMT2A rearrangement in Epigenomics network of TF MYC / MAX / MEF2D / IRF8 / HOXA9 and their role in relapse state.��We now have TF targeting drugs like JQ1. �But likely need for drug combination.
Main projects/group: Cancer Data Science & DepMap
Epigenomics and dependency of Leukemia
A distinct core regulatory module enforces oncogene expression in KMT2A-rearranged leukemia
Leukemia core transcriptional circuitry is a sparsely interconnected hierarchy stabilized by incoherent feed-forward loops
Epigenomics and dependency of Leukemia
Overview of the AML project pipeline
The main computational efforts of the AML project
Built: CREME (ChIP replicate Merger), Cobinding Matrix Tools
Used: ABCmodel, DeepTools, BedTools, IGV, NFcore, Nextflow, MACS2, …
Pipeline used by others now at DFCI.
But also: Slam-seq pipeline, Diff ChIP Tools, super res. Microscopy 3D data analysis.
Learnt a lot:
Biomarkers�Targets
Mechanisms
Perturbations
Achilles: Genome-wide CRISPR knockouts in cell lines
Previously RNAi (seed effect) �Now CRISPR (CN effect) ← New: guide match.
Ceres / Chronos to correct for these effects.��A Dependency: a gene that kills the cell when knocked out.
Identify gene perturbations with selective cancer-killing effects
Why the Dep. Map.? Why is this important?
Find the unbiased list of all dependencies across all cancers.
Initially used selectivity as a main measure for finding targets.
Finding targets → Finding biomarkers
Biomarkers are essentials.
Loss of VPS4A makes you dependent on VPS4B
Hard to find for our model → because arm level event. Crafting features.
CCLE: history, production and driving impact.
GTEx
mRNA expression
WES/WGS bams
RNAseq bams
Gene level CN
Mutations
CN Segments
Gene level
Transcript level
Filtered fusions
Unfiltered Fusions
GATK4 CNV
CGA mutation calling pipelines
STAR-Fusion
STAR
RSEM
Transcript fusions
Copy number
Mutations
~500 new lines
~400 new lines
Postprocess
Postprocess
Postprocess
Postprocess
Terra
Omics mutation pipeline
7000 bams
annot-ations
Model for finding mechanisms
WIP: Ashir Borah, Jeremie Kalfon
Mechanism through the non coding genome
WIP, Jeremie Kalfon, David Wu
Again: Pattern-based pretrained ML model (Enformer, DeepSea, Basenji, …)��1. Used spliceAI to generate psQTLs.
2. Validate them using RNAseq.
3. Look at exon-usage -> dependency correlation.
By subsetting search space using putative QTLs, found many relationships..
From Celligner1 to Celligner2
Celligner: Aligning models to tumors
Issues: Cell lines != Tumors. “Is my cell line a good proxy for disease X?”
Initially: CPCA (remove main batch effect: representing contamination) + MNN (align clusters: representing lineages)
Celligner v1
Original pipeline
TCGA+ dataset (TPM)
DepMap dataset (TPM)
Original pipeline
clustering
TCGA+ dataset (TPM)
DepMap dataset (TPM)
Cluster each dataset to identify genes that are differentially expressed between clusters
cluster
Original pipeline
Finding top 2x500 genes
TCGA+ dataset (TPM)
DepMap dataset (TPM)
Cluster each dataset to identify genes that are differentially expressed between clusters
cluster
DE genes
Original pipeline
cPCA to remove tumor infiltrating cell signal
TCGA+ dataset (TPM)
DepMap dataset (TPM)
cluster
DE genes
cPCA
Original pipeline
TCGA+ dataset (TPM)
DepMap dataset (TPM)
cluster
DE genes
cPCA
cPC corrected TCGA+ data
cPC corrected DepMap data
Original pipeline
MNN on top k genes
TCGA+ dataset (TPM)
DepMap dataset (TPM)
cluster
DE genes
cPCA
cPC corrected TCGA+ data
cPC corrected DepMap data
MNN
Original pipeline
Correct using MNN vectors and apply marioni correction
TCGA+ dataset (TPM)
DepMap dataset (TPM)
cluster
DE genes
cPCA
cPC corrected TCGA+ data
cPC corrected DepMap data
MNN
MNN corrected TCGA+ data
Original Celligner results
Celligner: Aligning models to tumors
Initially: CPCA + MNN
Now: VAE with specific features.
→ Add my own dataset, predict classes, make counterfactuals, add in different batches, explain predictions/corrections, ��→ Next: GNN, self-supervised training, ,…
Celligner2
Mutations
For answering the Q: “How similar is my CL from a tumor?” (Known cancer events) → Cellector.
Mutation patterns /Cell state�→MFmap
Use LoF/GoF matrices as additional inputs. Smooth from using GNN.�Then: explain expression patterns seen by using non coding features.
MOFA: Matrix decomposition
Quite complex
totalVI: Deep generative model
Functions are neural networks that serve to do the variational inference.�(finding an approx. function for this complex distribution).
Seurat: Nearest Neigh. based method
Quite complex. Same as Celligner but:
�- A set of Nearest Neigh. is like a graph. �- Find a set of features that maximize variance across modalities�- Find Anchors (using CCA first and then MNN)�- Weight samples based on distance to anchor and weight anchor based on how well they correlate across datasets / modalities
Unbiased - Differentiable Model: VAE
MFmap → semi supervision + DNAseq (using graphs)�trVAE → MMD in loss to mix datasets�Multigrate → combine different �Expimap → Use Gene sets to make explainable latent space�scVI → you can do statistical analysis on VAE output
Unbiased: adapts to any datatype we give it
Differentiable: �- We can add modules. �- Explainable: Many tools already exist�- Simple: an optimisation problem.�
Initial model:
trVAE with MMD loss creates great batch correction for DepMap & TCGA
Semi supervision: Adding a classifier improve all metrics
Using MFmap method to add a classification task on the latent space
The latent space preserves more data about samples
Umap of the latent space after training on DepMap and TCGA data with semi supervision
dataset
lineage
The latent space preserves more data about samples
Comparison between Celligner1’s representation and Celligner2’s
Having a useful representation
Celligner2’s latent space, taking some random axis and showing some classified labels
Celligner2 with no classification
Celligner2 outperforms Celligner1 on scIB
Performance comparison on 9 different scIB metrics between celligner v1 and v2 with varying number of input datasets
v1 (5-D)
v1 (2-D)
v2 (5-D)
v2 (3-D + unsup.)
Lineage classification ability on unseen DepMap
Classifying DepMap cell lines given a Celligner2 model trained only to classify TCGA. →
Celligner2 can reconstruct the expression count
It can make counterfactuals: “What if this cell line was a tumor?”
→ Predicting lineage and disease information improves reconstruction and counterfactuals’ quality.
Many open questions still
Correlation between true and reconstructed gene counts
diff. expr. Analysis on reconstructed output
�(From scVI) → Still WIP
Celligner can explain classification decisions using LRP
https://pubmed.ncbi.nlm.nih.gov/8794409/#:~:text=Cancer%2Dassociated%20retinopathy%20(CAR),autoimmune%20reactions%20directing%20retinal%20antigens, https://pubmed.ncbi.nlm.nih.gov/35464891/, https://pubmed.ncbi.nlm.nih.gov/28269748/
We can then apply GSEA to see if we find meaningful gene programs
E2F, MYC, IGSF21. All known targets / biomarker of lung cancer.
The classifier focuses on lung / lung cancer gene programs to predict whether or not a sample DepMap sample is lung
GSEA on relevant features predicting a DepMap lung cell line (with LRP)
GSEA: explaining classification decisions
https://pubmed.ncbi.nlm.nih.gov/8794409/#:~:text=Cancer%2Dassociated%20retinopathy%20(CAR),autoimmune%20reactions%20directing%20retinal%20antigens, https://pubmed.ncbi.nlm.nih.gov/35464891/,
In a model trained with GTEX data.
Explaining lung lineage prediction for CCLE cell lines only. We see a focus on other pathways that are more cell line specific
Adding more data helps make the model bigger
Umap of CCLE/TCGA/Gtex without classification for CCLE. 42 latent dimensions, 3000 key genes, larger model
Current work
MMD is not efficient when adding multiple bias classes
Current work and some next steps
Use GCNNs to scale our analysis to genome wide
First layers convolve on genes known to interact (from PPI and other sources)
Other Q: Can we then apply Celligner2 to dependency data?
Recap of the new version
New features
New usages
Celligner2 can work with new modalities
Simple tweak to add the “multigrate” framework.
Ok to work with missing features.
����Will likely need a lot of training data and would mostly work with paired single cell type data.
With an unbiased model, any Matrix is a new feature
Questions
Finding targets
Why is this hard?
Cell line features
Machine learning model
Model interpretation
Model accuracy
Why is this a hard problem?�
Omics
CCLE: history, production and driving impact.
CCLE2: Novartis + BroadInsitute.
Issues: No one left by mid-2019.
Productionalization: DepMapOmics with DMC and quarterly releases.
Impact: Open source state of the art for cancer cell line omics analysis.
RNA, WES, WGS, proteo, methylation, …
CCLE: history, production and driving impact.
GTEx
mRNA expression
WES/WGS bams
RNAseq bams
Gene level CN
Mutations
CN Segments
Gene level
Transcript level
Filtered fusions
Unfiltered Fusions
GATK4 CNV
CGA mutation calling pipelines
STAR-Fusion
STAR
RSEM
Transcript fusions
Copy number
Mutations
~500 new lines
~400 new lines
Postprocess
Postprocess
Postprocess
Postprocess
Terra
Omics mutation pipeline
7000 bams
annot-ations
CCLE: history, production and driving impact.
Model for finding mechanisms
WIP, Jérémie Kalfon
Task: Improve our biomarker prediction.��Issue: RNA is great, Mutations… not so much.
Idea: Change our framework of prediction
Linking Expression to Methylation status
Unpublished, WIP, David Wu, Beroukhim Lab
corr (predicted, actual expression)
using XGBoost
corr (expression vs coverage-weighted beta value)
SpliceAI: predicting splicing QTLs in silico
Large NN, resnet. dilated convolutional layers
Up to 10k bp context
Trained on GTeX dataset
Used as is in our pipeline, Sending it reference genome with SNPs
→ Prediction output used to run t-test over exon inclusion of samples [with / without] mutations
Successes
Splicing & non-coding features interpretation
WIP, Jérémie Kalfon, David Wu,
Fresh new results being investigated.
Intronic mutations predicted to splice-in exon 4 of geneX makes the cell line more dependent on geneY.
We can do a lot better.
Dependency present in Leukemias. �More relationships to come.
Intronic mutation in X
Gene Y
WRN & MSI
Microsatellite instability: Short (2-5) nucleotide repeats. Results from impaired DNA mismatch repair pathway.
MSI → WRN synth. dep.
De-risking targets
Going beyond and Prioritize targets
Approaches to further validate targets
In DepMap
Computational approaches:
“It will remain artisanal for a while…
We need to create tools for artisans”
Added complexity: drug polypharmacology
Drugs rarely produce similar effects to a gene knockout
→ Use Combination KOs
Interpretability through clusters
Recovering large biology events
20q CN
18q CN
VPS4B CN
VPS4B GE
CHMP4B CN
CHMP4B GE
Features
Cell Lines
Interpretability through clusters
Recovering large biology events
VPS4A ~ ‘ESCRTIII complex regulation by VPS4’ + ‘HAUS complex’