1 of 69

DepMap & Celligner2

Jérémie Kalfon -

DepMap: pan-Cancer biomarker & target discovery

2 of 69

About me

3 of 69

About me

CaImAn an open source tool for scalable calcium imaging data analysis | eLife, Hidden patterns of codon usage bias across kingdoms | Journal of The Royal Society Interface,

ECE: French, Ecole d’ingenieur, biomedical Engineering.
Research in Comp. Neuroscience at the Flatiron Institute
(M.sc.) ML at University of Kent.
Project in Codon Usage Bias
Start Up on group messaging platform: PiPle.

4 of 69

At the Broad Institute

A distinct core regulatory module enforces oncogene expression in KMT2A-rearranged leukemia | G&D, �Leukemia core transcriptional circuitry is a sparsely interconnected hierarchy stabilized by incoherent feed-forward loops. | in review, available on request

AML group @DFCI : Epigenomics and dependency landscape of Leukemias. ��→ Developed tools for large scale epigenomics. �→ Found role of KMT2A rearrangement in Epigenomics network of TF MYC / MAX / MEF2D / IRF8 / HOXA9 and their role in relapse state.��We now have TF targeting drugs like JQ1. �But likely need for drug combination.

Main projects/group: Cancer Data Science & DepMap

5 of 69

Epigenomics and dependency of Leukemia

A distinct core regulatory module enforces oncogene expression in KMT2A-rearranged leukemia

Leukemia core transcriptional circuitry is a sparsely interconnected hierarchy stabilized by incoherent feed-forward loops

Needed one project to talk about. Won’t go into the detail of this project. Even though it is very interesting.�

First paper highlight the first results we got from this dataset. Doing K27ac ChIP on hundreds of PDXs. Showing that there is a rearrangement that explain a big change in TF networks in a subset of leukemias, mainly pediatric. They explain change in dependency too

Explaining this change from using the cobinding matrix from hundreds of TF ChIP in one cell line.

Rearrangement could be mechanism of inflammation and progression by changing the epigenetic architecture to access new transcriptional programs.

Second paper digs dipper in this cobinding matrix and the architecture of the CRC in MV411. Using fast knock downs with degron tags on TFs. and short-time newly synthetised RNAseq measurement using the SLAMseq method.

Found many things but wanted to change the idea that CRC is an all to all connection. It is more complex. Can only be seen if looking at direct binding (motifs TFBS)

Incoherent FeedForwardLoop.

(3mn)

6 of 69

Epigenomics and dependency of Leukemia

Needed one project to talk about. Won’t go into the detail of this project. Even though it is very interesting.�

First paper highlight the first results we got from this dataset. Doing K27ac ChIP on hundreds of PDXs. Showing that there is a rearrangement that explain a big change in TF networks in a subset of leukemias, mainly pediatric. They explain change in dependency too

Explaining this change from using the cobinding matrix from hundreds of TF ChIP in one cell line.

Rearrangement could be mechanism of inflammation and progression by changing the epigenetic architecture to access new transcriptional programs.

Second paper digs dipper in this cobinding matrix and the architecture of the CRC in MV411. Using fast knock downs with degron tags on TFs. and short-time newly synthetised RNAseq measurement using the SLAMseq method.

Found many things but wanted to change the idea that CRC is an all to all connection. It is more complex. Can only be seen if looking at direct binding (motifs TFBS)

Incoherent FeedForwardLoop.

(3mn)

7 of 69

Overview of the AML project pipeline

8 of 69

The main computational efforts of the AML project

Built: CREME (ChIP replicate Merger), Cobinding Matrix Tools

Used: ABCmodel, DeepTools, BedTools, IGV, NFcore, Nextflow, MACS2, …

Pipeline used by others now at DFCI.

But also: Slam-seq pipeline, Diff ChIP Tools, super res. Microscopy 3D data analysis.

Learnt a lot:

is ChIPseq really useful?
Slamseq’s power and issues.

9 of 69

Biomarkers�Targets

Mechanisms

10 of 69

Perturbations

11 of 69

Achilles: Genome-wide CRISPR knockouts in cell lines

Unpublished, Chronos: a cell population dynamics model of CRISPR experiments that improves inference of gene fitness effects | Genome Biology | Full Text

Previously RNAi (seed effect) �Now CRISPR (CN effect) ← New: guide match.

Ceres / Chronos to correct for these effects.��A Dependency: a gene that kills the cell when knocked out.

Identify gene perturbations with selective cancer-killing effects

12 of 69

Why the Dep. Map.? Why is this important?

Find the unbiased list of all dependencies across all cancers.

Find a selective dependency.
See the lineage
Find the right biomarkers
Design a drug

Initially used selectivity as a main measure for finding targets.

13 of 69

Finding targets → Finding biomarkers

Synthetic Lethal Interaction between the ESCRT Paralog Enzymes VPS4A and VPS4B in Cancers Harboring Loss of Chromosome 18q or 16q

Biomarkers are essentials.

Loss of VPS4A makes you dependent on VPS4B

Hard to find for our model → because arm level event. Crafting features.

14 of 69

CCLE: history, production and driving impact.

GTEx

mRNA expression

WES/WGS bams

RNAseq bams

Gene level CN

Mutations

CN Segments

Gene level

Transcript level

Filtered fusions

Unfiltered Fusions

GATK4 CNV

CGA mutation calling pipelines

STAR-Fusion

STAR

RSEM

Transcript fusions

Copy number

Mutations

~500 new lines

~400 new lines

Postprocess

Terra

Omics mutation pipeline

7000 bams

annot-ations

15 of 69

Model for finding mechanisms

WIP: Ashir Borah, Jeremie Kalfon

16 of 69

Mechanism through the non coding genome

WIP, Jeremie Kalfon, David Wu

Again: Pattern-based pretrained ML model (Enformer, DeepSea, Basenji, …)��1. Used spliceAI to generate psQTLs.

2. Validate them using RNAseq.

3. Look at exon-usage -> dependency correlation.

By subsetting search space using putative QTLs, found many relationships..

17 of 69

From Celligner1 to Celligner2

Why talk about that?

But many times the biggest impacts can be done by productionalizing something, spending time looking at features, popping up, reprocessing data. Changing framework of thought.�

But also showing that along the years, starting with a pretty small knowledge in genomics and oncology, I have been able to take responsibility over a big part of the depmap project. And been able to gather a team around me to drive this large effort.��From this dataset, we have been able to to reach mechanistic understanding for some of these targets. I am very proud of having been part of that. There is dozen more found targets which I haven’t been a part of. This data that we generate is used by 1,000 of researchers and oncologists around the world.

�also lots of other projects. Dear to my heart

18 of 69

Celligner: Aligning models to tumors

Issues: Cell lines != Tumors. “Is my cell line a good proxy for disease X?”

Initially: CPCA (remove main batch effect: representing contamination) + MNN (align clusters: representing lineages)

19 of 69

Celligner v1

20 of 69

Original pipeline

TCGA+ dataset (TPM)

DepMap dataset (TPM)

21 of 69

Original pipeline

clustering

TCGA+ dataset (TPM)

DepMap dataset (TPM)

Cluster each dataset to identify genes that are differentially expressed between clusters

cluster

22 of 69

Original pipeline

Finding top 2x500 genes

TCGA+ dataset (TPM)

DepMap dataset (TPM)

Cluster each dataset to identify genes that are differentially expressed between clusters

cluster

DE genes

23 of 69

Original pipeline

cPCA to remove tumor infiltrating cell signal

TCGA+ dataset (TPM)

DepMap dataset (TPM)

Run contrastive principal components analysis (cPCA) to identify patterns of correlated variation that are enriched in one dataset relative to another
To avoid biases resulting from the differential disease composition between the two datasets, we use the already calculated clusters to remove the average expression per cluster so that cPCA contrasts the intra-cluster covariance structure between the cell line and tumor data
Regress out the top (4) gene expression signatures w/ elevated variances across tumor samples compared to cell lines (mostly immune cell related) from both datasets

cluster

DE genes

cPCA

24 of 69

Original pipeline

TCGA+ dataset (TPM)

DepMap dataset (TPM)

Run contrastive principal components analysis (cPCA) to identify patterns of correlated variation that are enriched in one dataset relative to another
To avoid biases resulting from the differential disease composition between the two datasets, we use the already calculated clusters to remove the average expression per cluster so that cPCA contrasts the intra-cluster covariance structure between the cell line and tumor data
Regress out the top (4) gene expression signatures w/ elevated variances across tumor samples compared to cell lines (mostly immune cell related) from both datasets

cluster

DE genes

cPCA

cPC corrected TCGA+ data

cPC corrected DepMap data

25 of 69

Original pipeline

MNN on top k genes

TCGA+ dataset (TPM)

DepMap dataset (TPM)

Run mutual nearest neighbors batch correction further align the datasets
Use differentially expressed genes to identify mutual nearest neighbors

cluster

DE genes

cPCA

cPC corrected TCGA+ data

cPC corrected DepMap data

MNN

26 of 69

Original pipeline

Correct using MNN vectors and apply marioni correction

TCGA+ dataset (TPM)

DepMap dataset (TPM)

cluster

DE genes

cPCA

cPC corrected TCGA+ data

cPC corrected DepMap data

MNN

MNN corrected TCGA+ data

27 of 69

Original Celligner results

TCGA+ (TCGA, TARGET, Treehouse) dataset: 12,236 tumors
DepMap 19Q4 public cell lines: 1,249 cell lines
Portal: https://depmap.org/portal/celligner/

28 of 69

Celligner: Aligning models to tumors

Initially: CPCA + MNN

Now: VAE with specific features.

→ Add my own dataset, predict classes, make counterfactuals, add in different batches, explain predictions/corrections, ��→ Next: GNN, self-supervised training, ,…

29 of 69

Celligner2

30 of 69

Mutations

MFmap: A semi-supervised generative model matching cell lines to tumours and cancer subtypes | PLOS ONE

For answering the Q: “How similar is my CL from a tumor?” (Known cancer events) → Cellector.

Mutation patterns /Cell state�→MFmap

Use LoF/GoF matrices as additional inputs. Smooth from using GNN.�Then: explain expression patterns seen by using non coding features.

31 of 69

MOFA: Matrix decomposition

MOFA | Multi-Omics Factor Analysis

Quite complex

32 of 69

totalVI: Deep generative model

totalVI - scvi-tools

Functions are neural networks that serve to do the variational inference.�(finding an approx. function for this complex distribution).

33 of 69

Seurat: Nearest Neigh. based method

Comprehensive Integration of Single-Cell Data

Quite complex. Same as Celligner but:

Select top K features from each modalities
Weighting MNN graph across modalities

�- A set of Nearest Neigh. is like a graph. �- Find a set of features that maximize variance across modalities�- Find Anchors (using CCA first and then MNN)�- Weight samples based on distance to anchor and weight anchor based on how well they correlate across datasets / modalities

34 of 69

Unbiased - Differentiable Model: VAE

www.nxn.se/valent/2022/3/9/vaes-are-explainable-differential-expression-in-scvi

MFmap → semi supervision + DNAseq (using graphs)�trVAE → MMD in loss to mix datasets�Multigrate → combine different �Expimap → Use Gene sets to make explainable latent space�scVI → you can do statistical analysis on VAE output

Unbiased: adapts to any datatype we give it

Differentiable: �- We can add modules. �- Explainable: Many tools already exist�- Simple: an optimisation problem.�

35 of 69

Initial model:

trVAE with MMD loss creates great batch correction for DepMap & TCGA

Theis lab’s trVAR

Maxim Mean Disrepency: measure of distance / loss on similarity between 2 distributions. non parametric and using kernels.

More:

it is using the maximum mean difference over kernels to define a mean that represent more than the center of each distributions. But also the shape (higher dimensional)

KL loss, “variational” part of the encoder as you are trying to estimate your P(z) from P(z|x) which is intractable so you are simplifying it with a normal that should be as close as possible to P(z|x)

Changed recon loss, changed some preprocessing, model size, some other parameters to improve accuracy. (already very good by default)

Used nb instead of zero-inflated-nb. Loss

Needed to apply a better batch mixing (with bias towards samples part of smaller clusters contain less represented lineage)

For the loss, the model learns a dispersion parameters to decrease effect on loss for gene that are more overdispersed / harder to model. (per gene per batch)

36 of 69

Semi supervision: Adding a classifier improve all metrics

Using MFmap method to add a classification task on the latent space

Only limited impact on overall loss
Creates a more explainable latent space
The model focuses on what the user deems more important for its need (lineage, disease type, genetic feature e.g. MSI, …)

37 of 69

The latent space preserves more data about samples

Umap of the latent space after training on DepMap and TCGA data with semi supervision

dataset

lineage

38 of 69

The latent space preserves more data about samples

Comparison between Celligner1’s representation and Celligner2’s

39 of 69

Having a useful representation

Celligner2’s latent space, taking some random axis and showing some classified labels

40 of 69

Celligner2 with no classification

41 of 69

Celligner2 outperforms Celligner1 on scIB

Performance comparison on 9 different scIB metrics between celligner v1 and v2 with varying number of input datasets

v1 (5-D)

v1 (2-D)

v2 (5-D)

v2 (3-D + unsup.)

42 of 69

Lineage classification ability on unseen DepMap

Classifying DepMap cell lines given a Celligner2 model trained only to classify TCGA. →

On par with MFmap’s results.
Mistakes are often of similar lineage (e.g. Liposarcoma / sarcoma)
Accuracy above .94 on TCGA.

43 of 69

Celligner2 can reconstruct the expression count

[scvi-tools] models-totalvi: counterfactual

It can make counterfactuals: “What if this cell line was a tumor?”

→ Predicting lineage and disease information improves reconstruction and counterfactuals’ quality.

Many open questions still

Correlation between true and reconstructed gene counts

44 of 69

diff. expr. Analysis on reconstructed output

Can take two samples.
Sample from their posterior .
Generate a dist. for each gene.
Then generate a volcano plot.
Can do the same at the population level

�(From scVI) → Still WIP

45 of 69

Celligner can explain classification decisions using LRP

https://pubmed.ncbi.nlm.nih.gov/8794409/#:~:text=Cancer%2Dassociated%20retinopathy%20(CAR),autoimmune%20reactions%20directing%20retinal%20antigens, https://pubmed.ncbi.nlm.nih.gov/35464891/, https://pubmed.ncbi.nlm.nih.gov/28269748/

We can then apply GSEA to see if we find meaningful gene programs

E2F, MYC, IGSF21. All known targets / biomarker of lung cancer.

The classifier focuses on lung / lung cancer gene programs to predict whether or not a sample DepMap sample is lung

GSEA on relevant features predicting a DepMap lung cell line (with LRP)

Looking at lung. Always been very close to the EMT cluster and the idea is that it could focus on other pathways than lung pathways.

explain XAI tools.

shapley

Integrated Gradients: approximating the integral of gradients of the model’s output with respect to the inputs (Riemann sum: summing over sampling)�LRP: per layer backward propagation mechanism, backward algorithm defined per layer. Layer-wise relevence propagation.

explain additional explainer class that makes plots / explainings and other toolkits integrated.

cMYC is a hallmark in some cancers, including lung. Microglial embryonic state is a highly proliferative state. ← primed by tcga and regular lung primmed by gtex.�

1. Lung / bronchiolar / olfactory / eophagus … up to stomach (likely related to lung cancer of various origins)

2. NK T cells well known in cancer. IGSF21 are typical lung cancer responses.

3. Microglial embryonic state is a highly proliferative state. ← primed by ccle�BCHE expressed in gastric/oesophageus tissues

4. E2F very known in lung cancer https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5990399/

Something I have started to do is to show that we can understand differences between cell lines and tumors for specific models, using these tools.

46 of 69

GSEA: explaining classification decisions

https://pubmed.ncbi.nlm.nih.gov/8794409/#:~:text=Cancer%2Dassociated%20retinopathy%20(CAR),autoimmune%20reactions%20directing%20retinal%20antigens, https://pubmed.ncbi.nlm.nih.gov/35464891/,

In a model trained with GTEX data.

Explaining lung lineage prediction for CCLE cell lines only. We see a focus on other pathways that are more cell line specific

We see things that first seem to not make sense. But this is the patterns of activation that it creates. It did not learn nonsense. (as been shown in previous plot; just things on which it is focusing to predict disease type / reconstruct / batch correct)

So what does it mean?

Fetal → stem cell, progenitor cell. ← pathways used by cancer�

MidBrain neurotypes, cilliar body. �More recently (2021): “In the airways, basal cells can give rise to ciliated, neuroendocrine and club cells [2-5]. Basal cells are progenitor cells closely associated with the basal lamina that maintain airway homeostasis. Basal cells are considered the candidate cell of origin in lung SqCC" https://onlinelibrary.wiley.com/doi/full/10.1111/joim.13201

→ Cilliary body in the eye (iris)

E2F3 known hallmark of lung cancer and E2F members are new targets.�https://pubmed.ncbi.nlm.nih.gov/28269748/

Origin of most pulmonary cancer is unknown. One hypothesis is that lung cancer originates mostly from the pulmonary plexus - takes on a more proliferative state this way and

47 of 69

Adding more data helps make the model bigger

Umap of CCLE/TCGA/Gtex without classification for CCLE. 42 latent dimensions, 3000 key genes, larger model

48 of 69

Current work

MMD is not efficient when adding multiple bias classes

Combinatorial explosion when computing MMD �
Works with continuous features as well (like removing purity component based on percent purity)�
Still WIP. yet unsure if this allows any improvements.

49 of 69

Current work and some next steps

Use GCNNs to scale our analysis to genome wide

First layers convolve on genes known to interact (from PPI and other sources)

Allows deeper models with skip connections.
Can predict new relationships.

Other Q: Can we then apply Celligner2 to dependency data?

graphNN is a way to convolve on similar genes and reduce the model size drastically by not having all to all connections. We could make it deeper this way. Using basic DNN tools. Might help us add other features.

A final goal is to add another dataset: our dependencies. → to see if we can make sensible counterfactual dependency predictions for tumors. Can the model guess dependency that are not transferable to tumors or exist solely in tumors?

State of things we have the outline of the paper. With some remaining experiments to get but every time we have more questions etc.. and we feel we need to add more features to make it truly useful.

There was also an administrative problem in going further. Which makes it now quite difficult to continue. Talk more about this during questions.

50 of 69

Recap of the new version

able to work with many datasets
perform better correction when large bath effects exist. (e.g. between Cancer cell lines and frozen tumor tissues)
Counterfactual predictions of gene count outputs “show this cell line as if it was a tumor”
explainability using Explainable AI tools like LRP with GSEA / differential expression
QC methods: getting at quality (using scIB). Various interactive umap plots.
semi-supervision to classify cell type and any other feature provided.�
added scArches’ Model surgery .
improved model size by using more input genes (3000 instead of <1000 previously).
reproducing all results from celligner1 ( + faster to train)

a model that can be tuned
adding your own data, tumors, pdx, cell lines, 3D..
figuring the expected expression / of genes if the model was a tumor
works also for scRNAseq
easily extendable to new modalities (NN’s compositionality)..

New features

New usages

51 of 69

Celligner2 can work with new modalities

Multigrate: single-cell multi-omic data integration | bioRxiv

Simple tweak to add the “multigrate” framework.

Ok to work with missing features.

��Will likely need a lot of training data and would mostly work with paired single cell type data.

With an unbiased model, any Matrix is a new feature

52 of 69

Questions

Je vais ouvrir aux questions.

Est ce que vous faites le topcoder challenge du Broad? Predict tumor infiltrating tcell states from specific gene knock outs (for immunotherapy purposes)
Est ce que je peux contacter un membre de l’équipe dans la semaine pour discuter un peu du day to day et de l’ambiance?

→ My boss was very into this and this was agreed between us that I would publish on this. But he left for industry and Bill Sellers now head of Cancer program is pushing a lot to stop projets like this one and refocus on big large scale analysis of new data. (Bill, devellopé imatinib) n’était pas fan du projet. Il y a la pression a (paquita, créer depmap) pour terminer ce projet et se focus sur des choses plus dans le style du cancer programme. Ils veulent vraiment que je reste et m’ont mis sur un projet ultra cool a la place pour produire perturb seq avec SHARE-seq sur toutes nos lignées cellulaire avec Jason (buenrostro, inventé ATACseq, SHAREseq) �aussi une des nombreuses raisons de mon départ. J’ai pas envie de perdre du temps et je veux plus aller vers des domaines plus ML

53 of 69

Finding targets

54 of 69

Why is this hard?

Cell line features

Machine learning model

Model interpretation

Model accuracy

Why is this a hard problem?�

Many more features than samples, few sensitive lines
Potentially complex (i.e. nonlinear, multi-factor) relationships
Features are highly intercorrelated

55 of 69

Omics

56 of 69

CCLE: history, production and driving impact.

Next-generation characterization of the Cancer Cell Line Encyclopedia | Nature

CCLE2: Novartis + BroadInsitute.

Issues: No one left by mid-2019.

Productionalization: DepMapOmics with DMC and quarterly releases.

Impact: Open source state of the art for cancer cell line omics analysis.

RNA, WES, WGS, proteo, methylation, …

57 of 69

CCLE: history, production and driving impact.

GTEx

mRNA expression

WES/WGS bams

RNAseq bams

Gene level CN

Mutations

CN Segments

Gene level

Transcript level

Filtered fusions

Unfiltered Fusions

GATK4 CNV

CGA mutation calling pipelines

STAR-Fusion

STAR

RSEM

Transcript fusions

Copy number

Mutations

~500 new lines

~400 new lines

Postprocess

Terra

Omics mutation pipeline

7000 bams

annot-ations

58 of 69

CCLE: history, production and driving impact.

59 of 69

Model for finding mechanisms

WIP, Jérémie Kalfon

Task: Improve our biomarker prediction.��Issue: RNA is great, Mutations… not so much.

Idea: Change our framework of prediction

60 of 69

Linking Expression to Methylation status

Unpublished, WIP, David Wu, Beroukhim Lab

corr (predicted, actual expression)

using XGBoost

corr (expression vs coverage-weighted beta value)

61 of 69

SpliceAI: predicting splicing QTLs in silico

Predicting Splicing from Primary Sequence with Deep Learning: Cell

Large NN, resnet. dilated convolutional layers

Up to 10k bp context

Trained on GTeX dataset

Used as is in our pipeline, Sending it reference genome with SNPs

→ Prediction output used to run t-test over exon inclusion of samples [with / without] mutations

62 of 69

Successes

63 of 69

Splicing & non-coding features interpretation

WIP, Jérémie Kalfon, David Wu,

Fresh new results being investigated.

Intronic mutations predicted to splice-in exon 4 of geneX makes the cell line more dependent on geneY.

We can do a lot better.

Dependency present in Leukemias. �More relationships to come.

Intronic mutation in X

Gene Y

64 of 69

WRN & MSI

Repeat expansions confer WRN dependence in microsatellite-unstable cancers

Microsatellite instability: Short (2-5) nucleotide repeats. Results from impaired DNA mismatch repair pathway.

TA-repeats form DNA secondary structures.�
Stall replication forks.�
Require unwinding by the WRN helicase.

MSI → WRN synth. dep.

65 of 69

De-risking targets

66 of 69

Going beyond and Prioritize targets

Approaches to further validate targets

In DepMap

Druggable CansarDB
in vivo screens (PDXs)
PRISM

Computational approaches:

Alphafold2: predict Prot. 3D struct.
MCTS: planning chemical synthesis.
GFlowNets: for chemical design.
drug toxicity predictions, Planning experiments for large combinatorial problems.

“It will remain artisanal for a while…

We need to create tools for artisans”

67 of 69

Added complexity: drug polypharmacology

Drugs rarely produce similar effects to a gene knockout

→ Use Combination KOs

68 of 69

Interpretability through clusters

Recovering large biology events

Inject prior knowledge�
Find patterns among associated features�
Build mechanistic hypotheses

20q CN

18q CN

VPS4B CN

VPS4B GE

CHMP4B CN

CHMP4B GE

Features

Cell Lines

69 of 69

Interpretability through clusters

Recovering large biology events

VPS4A ~ ‘ESCRTIII complex regulation by VPS4’ + ‘HAUS complex’