1 of 69

DepMap & Celligner2

Jérémie Kalfon -

DepMap: pan-Cancer biomarker & target discovery

2 of 69

About me

3 of 69

About me

  • ECE: French, Ecole d’ingenieur, biomedical Engineering.
  • Research in Comp. Neuroscience at the Flatiron Institute
  • (M.sc.) ML at University of Kent.
  • Project in Codon Usage Bias
  • Start Up on group messaging platform: PiPle.

4 of 69

At the Broad Institute

A distinct core regulatory module enforces oncogene expression in KMT2A-rearranged leukemia | G&D, �Leukemia core transcriptional circuitry is a sparsely interconnected hierarchy stabilized by incoherent feed-forward loops. | in review, available on request

AML group @DFCI : Epigenomics and dependency landscape of Leukemias. ��→ Developed tools for large scale epigenomics. �→ Found role of KMT2A rearrangement in Epigenomics network of TF MYC / MAX / MEF2D / IRF8 / HOXA9 and their role in relapse state.��We now have TF targeting drugs like JQ1. �But likely need for drug combination.

Main projects/group: Cancer Data Science & DepMap

5 of 69

Epigenomics and dependency of Leukemia

A distinct core regulatory module enforces oncogene expression in KMT2A-rearranged leukemia

Leukemia core transcriptional circuitry is a sparsely interconnected hierarchy stabilized by incoherent feed-forward loops

6 of 69

Epigenomics and dependency of Leukemia

7 of 69

Overview of the AML project pipeline

8 of 69

The main computational efforts of the AML project

Built: CREME (ChIP replicate Merger), Cobinding Matrix Tools

Used: ABCmodel, DeepTools, BedTools, IGV, NFcore, Nextflow, MACS2, …

Pipeline used by others now at DFCI.

But also: Slam-seq pipeline, Diff ChIP Tools, super res. Microscopy 3D data analysis.

Learnt a lot:

  • is ChIPseq really useful?
  • Slamseq’s power and issues.

9 of 69

Biomarkers�Targets

Mechanisms

10 of 69

Perturbations

11 of 69

Achilles: Genome-wide CRISPR knockouts in cell lines

Previously RNAi (seed effect) �Now CRISPR (CN effect) ← New: guide match.

Ceres / Chronos to correct for these effects.��A Dependency: a gene that kills the cell when knocked out.

Identify gene perturbations with selective cancer-killing effects

12 of 69

Why the Dep. Map.? Why is this important?

Find the unbiased list of all dependencies across all cancers.

  1. Find a selective dependency.
  2. See the lineage
  3. Find the right biomarkers
  4. Design a drug

Initially used selectivity as a main measure for finding targets.

13 of 69

Finding targets → Finding biomarkers

Biomarkers are essentials.

Loss of VPS4A makes you dependent on VPS4B

Hard to find for our model → because arm level event. Crafting features.

14 of 69

CCLE: history, production and driving impact.

GTEx

mRNA expression

WES/WGS bams

RNAseq bams

Gene level CN

Mutations

CN Segments

Gene level

Transcript level

Filtered fusions

Unfiltered Fusions

GATK4 CNV

CGA mutation calling pipelines

STAR-Fusion

STAR

RSEM

Transcript fusions

Copy number

Mutations

~500 new lines

~400 new lines

Postprocess

Postprocess

Postprocess

Postprocess

Terra

Omics mutation pipeline

7000 bams

annot-ations

15 of 69

Model for finding mechanisms

WIP: Ashir Borah, Jeremie Kalfon

16 of 69

Mechanism through the non coding genome

WIP, Jeremie Kalfon, David Wu

Again: Pattern-based pretrained ML model (Enformer, DeepSea, Basenji, …)��1. Used spliceAI to generate psQTLs.

2. Validate them using RNAseq.

3. Look at exon-usage -> dependency correlation.

By subsetting search space using putative QTLs, found many relationships..

17 of 69

From Celligner1 to Celligner2

18 of 69

Celligner: Aligning models to tumors

Issues: Cell lines != Tumors. Is my cell line a good proxy for disease X?

Initially: CPCA (remove main batch effect: representing contamination) + MNN (align clusters: representing lineages)

19 of 69

Celligner v1

20 of 69

Original pipeline

TCGA+ dataset (TPM)

DepMap dataset (TPM)

21 of 69

Original pipeline

clustering

TCGA+ dataset (TPM)

DepMap dataset (TPM)

Cluster each dataset to identify genes that are differentially expressed between clusters

cluster

22 of 69

Original pipeline

Finding top 2x500 genes

TCGA+ dataset (TPM)

DepMap dataset (TPM)

Cluster each dataset to identify genes that are differentially expressed between clusters

cluster

DE genes

23 of 69

Original pipeline

cPCA to remove tumor infiltrating cell signal

TCGA+ dataset (TPM)

DepMap dataset (TPM)

  • Run contrastive principal components analysis (cPCA) to identify patterns of correlated variation that are enriched in one dataset relative to another
  • To avoid biases resulting from the differential disease composition between the two datasets, we use the already calculated clusters to remove the average expression per cluster so that cPCA contrasts the intra-cluster covariance structure between the cell line and tumor data
  • Regress out the top (4) gene expression signatures w/ elevated variances across tumor samples compared to cell lines (mostly immune cell related) from both datasets

cluster

DE genes

cPCA

24 of 69

Original pipeline

TCGA+ dataset (TPM)

DepMap dataset (TPM)

  • Run contrastive principal components analysis (cPCA) to identify patterns of correlated variation that are enriched in one dataset relative to another
  • To avoid biases resulting from the differential disease composition between the two datasets, we use the already calculated clusters to remove the average expression per cluster so that cPCA contrasts the intra-cluster covariance structure between the cell line and tumor data
  • Regress out the top (4) gene expression signatures w/ elevated variances across tumor samples compared to cell lines (mostly immune cell related) from both datasets

cluster

DE genes

cPCA

cPC corrected TCGA+ data

cPC corrected DepMap data

25 of 69

Original pipeline

MNN on top k genes

TCGA+ dataset (TPM)

DepMap dataset (TPM)

  • Run mutual nearest neighbors batch correction further align the datasets
  • Use differentially expressed genes to identify mutual nearest neighbors

cluster

DE genes

cPCA

cPC corrected TCGA+ data

cPC corrected DepMap data

MNN

26 of 69

Original pipeline

Correct using MNN vectors and apply marioni correction

TCGA+ dataset (TPM)

DepMap dataset (TPM)

cluster

DE genes

cPCA

cPC corrected TCGA+ data

cPC corrected DepMap data

MNN

MNN corrected TCGA+ data

27 of 69

Original Celligner results

  • TCGA+ (TCGA, TARGET, Treehouse) dataset: 12,236 tumors
  • DepMap 19Q4 public cell lines: 1,249 cell lines
  • Portal: https://depmap.org/portal/celligner/

28 of 69

Celligner: Aligning models to tumors

Initially: CPCA + MNN

Now: VAE with specific features.

→ Add my own dataset, predict classes, make counterfactuals, add in different batches, explain predictions/corrections, ��→ Next: GNN, self-supervised training, ,…

29 of 69

Celligner2

30 of 69

Mutations

For answering the Q: “How similar is my CL from a tumor?” (Known cancer events) → Cellector.

Mutation patterns /Cell state�→MFmap

Use LoF/GoF matrices as additional inputs. Smooth from using GNN.�Then: explain expression patterns seen by using non coding features.

31 of 69

MOFA: Matrix decomposition

Quite complex

32 of 69

totalVI: Deep generative model

Functions are neural networks that serve to do the variational inference.�(finding an approx. function for this complex distribution).

33 of 69

Seurat: Nearest Neigh. based method

Quite complex. Same as Celligner but:

  1. Select top K features from each modalities
  2. Weighting MNN graph across modalities

- A set of Nearest Neigh. is like a graph. �- Find a set of features that maximize variance across modalities�- Find Anchors (using CCA first and then MNN)�- Weight samples based on distance to anchor and weight anchor based on how well they correlate across datasets / modalities

34 of 69

Unbiased - Differentiable Model: VAE

MFmap → semi supervision + DNAseq (using graphs)�trVAE → MMD in loss to mix datasets�Multigrate → combine different �Expimap → Use Gene sets to make explainable latent space�scVI → you can do statistical analysis on VAE output

Unbiased: adapts to any datatype we give it

Differentiable: �- We can add modules. �- Explainable: Many tools already exist�- Simple: an optimisation problem.

35 of 69

Initial model:

trVAE with MMD loss creates great batch correction for DepMap & TCGA

36 of 69

Semi supervision: Adding a classifier improve all metrics

Using MFmap method to add a classification task on the latent space

  1. Only limited impact on overall loss
  2. Creates a more explainable latent space
  3. The model focuses on what the user deems more important for its need (lineage, disease type, genetic feature e.g. MSI, …)

37 of 69

The latent space preserves more data about samples

Umap of the latent space after training on DepMap and TCGA data with semi supervision

dataset

lineage

38 of 69

The latent space preserves more data about samples

Comparison between Celligner1’s representation and Celligner2’s

39 of 69

Having a useful representation

Celligner2’s latent space, taking some random axis and showing some classified labels

40 of 69

Celligner2 with no classification

41 of 69

Celligner2 outperforms Celligner1 on scIB

Performance comparison on 9 different scIB metrics between celligner v1 and v2 with varying number of input datasets

v1 (5-D)

v1 (2-D)

v2 (5-D)

v2 (3-D + unsup.)

42 of 69

Lineage classification ability on unseen DepMap

Classifying DepMap cell lines given a Celligner2 model trained only to classify TCGA. →

  • On par with MFmap’s results.
  • Mistakes are often of similar lineage (e.g. Liposarcoma / sarcoma)
  • Accuracy above .94 on TCGA.

43 of 69

Celligner2 can reconstruct the expression count

It can make counterfactuals: “What if this cell line was a tumor?”

→ Predicting lineage and disease information improves reconstruction and counterfactuals’ quality.

Many open questions still

Correlation between true and reconstructed gene counts

44 of 69

diff. expr. Analysis on reconstructed output

  • Can take two samples.
  • Sample from their posterior .
  • Generate a dist. for each gene.
  • Then generate a volcano plot.
  • Can do the same at the population level

�(From scVI) → Still WIP

45 of 69

Celligner can explain classification decisions using LRP

We can then apply GSEA to see if we find meaningful gene programs

E2F, MYC, IGSF21. All known targets / biomarker of lung cancer.

The classifier focuses on lung / lung cancer gene programs to predict whether or not a sample DepMap sample is lung

GSEA on relevant features predicting a DepMap lung cell line (with LRP)

46 of 69

GSEA: explaining classification decisions

In a model trained with GTEX data.

Explaining lung lineage prediction for CCLE cell lines only. We see a focus on other pathways that are more cell line specific

47 of 69

Adding more data helps make the model bigger

Umap of CCLE/TCGA/Gtex without classification for CCLE. 42 latent dimensions, 3000 key genes, larger model

48 of 69

Current work

MMD is not efficient when adding multiple bias classes

  • Combinatorial explosion when computing MMD �
  • Works with continuous features as well (like removing purity component based on percent purity)�
  • Still WIP. yet unsure if this allows any improvements.

49 of 69

Current work and some next steps

Use GCNNs to scale our analysis to genome wide

First layers convolve on genes known to interact (from PPI and other sources)

  1. Allows deeper models with skip connections.
  2. Can predict new relationships.

Other Q: Can we then apply Celligner2 to dependency data?

50 of 69

Recap of the new version

  • able to work with many datasets
  • perform better correction when large bath effects exist. (e.g. between Cancer cell lines and frozen tumor tissues)
  • Counterfactual predictions of gene count outputs “show this cell line as if it was a tumor”
  • explainability using Explainable AI tools like LRP with GSEA / differential expression
  • QC methods: getting at quality (using scIB). Various interactive umap plots.
  • semi-supervision to classify cell type and any other feature provided.�
  • added scArches’ Model surgery .
  • improved model size by using more input genes (3000 instead of <1000 previously).
  • reproducing all results from celligner1 ( + faster to train)

  • a model that can be tuned
  • adding your own data, tumors, pdx, cell lines, 3D..
  • figuring the expected expression / of genes if the model was a tumor
  • works also for scRNAseq
  • easily extendable to new modalities (NN’s compositionality)..

New features

New usages

51 of 69

Celligner2 can work with new modalities

Simple tweak to add the “multigrate” framework.

Ok to work with missing features.

����Will likely need a lot of training data and would mostly work with paired single cell type data.

With an unbiased model, any Matrix is a new feature

52 of 69

Questions

53 of 69

Finding targets

54 of 69

Why is this hard?

Cell line features

Machine learning model

Model interpretation

Model accuracy

Why is this a hard problem?�

  • Many more features than samples, few sensitive lines
  • Potentially complex (i.e. nonlinear, multi-factor) relationships
  • Features are highly intercorrelated

55 of 69

Omics

56 of 69

CCLE: history, production and driving impact.

CCLE2: Novartis + BroadInsitute.

Issues: No one left by mid-2019.

Productionalization: DepMapOmics with DMC and quarterly releases.

Impact: Open source state of the art for cancer cell line omics analysis.

RNA, WES, WGS, proteo, methylation, …

57 of 69

CCLE: history, production and driving impact.

GTEx

mRNA expression

WES/WGS bams

RNAseq bams

Gene level CN

Mutations

CN Segments

Gene level

Transcript level

Filtered fusions

Unfiltered Fusions

GATK4 CNV

CGA mutation calling pipelines

STAR-Fusion

STAR

RSEM

Transcript fusions

Copy number

Mutations

~500 new lines

~400 new lines

Postprocess

Postprocess

Postprocess

Postprocess

Terra

Omics mutation pipeline

7000 bams

annot-ations

58 of 69

CCLE: history, production and driving impact.

59 of 69

Model for finding mechanisms

WIP, Jérémie Kalfon

Task: Improve our biomarker prediction.��Issue: RNA is great, Mutations… not so much.

Idea: Change our framework of prediction

60 of 69

Linking Expression to Methylation status

Unpublished, WIP, David Wu, Beroukhim Lab

corr (predicted, actual expression)

using XGBoost

corr (expression vs coverage-weighted beta value)

61 of 69

SpliceAI: predicting splicing QTLs in silico

Large NN, resnet. dilated convolutional layers

Up to 10k bp context

Trained on GTeX dataset

Used as is in our pipeline, Sending it reference genome with SNPs

→ Prediction output used to run t-test over exon inclusion of samples [with / without] mutations

62 of 69

Successes

63 of 69

Splicing & non-coding features interpretation

WIP, Jérémie Kalfon, David Wu,

Fresh new results being investigated.

Intronic mutations predicted to splice-in exon 4 of geneX makes the cell line more dependent on geneY.

We can do a lot better.

Dependency present in Leukemias. �More relationships to come.

Intronic mutation in X

Gene Y

64 of 69

WRN & MSI

Microsatellite instability: Short (2-5) nucleotide repeats. Results from impaired DNA mismatch repair pathway.

  1. TA-repeats form DNA secondary structures.�
  2. Stall replication forks.�
  3. Require unwinding by the WRN helicase.

MSI → WRN synth. dep.

65 of 69

De-risking targets

66 of 69

Going beyond and Prioritize targets

Approaches to further validate targets

In DepMap

  • Druggable CansarDB
  • in vivo screens (PDXs)
  • PRISM

Computational approaches:

  • Alphafold2: predict Prot. 3D struct.
  • MCTS: planning chemical synthesis.
  • GFlowNets: for chemical design.
  • drug toxicity predictions, Planning experiments for large combinatorial problems.

“It will remain artisanal for a while…

We need to create tools for artisans”

67 of 69

Added complexity: drug polypharmacology

Drugs rarely produce similar effects to a gene knockout

→ Use Combination KOs

68 of 69

Interpretability through clusters

Recovering large biology events

  • Inject prior knowledge�
  • Find patterns among associated features�
  • Build mechanistic hypotheses

20q CN

18q CN

VPS4B CN

VPS4B GE

CHMP4B CN

CHMP4B GE

Features

Cell Lines

69 of 69

Interpretability through clusters

Recovering large biology events

VPS4A ~ ‘ESCRTIII complex regulation by VPS4’ + ‘HAUS complex’