1 of 31

STARTLE: A STAR HOMOPLASY APPROACH FOR CRISPR-CAS9 LINEAGE TRACING

Palash Sashittal, Henri Schmidt, Michelle Chan, and Benjamin J Raphael

2 of 31

Motivation

  • CRISPR-Cas9 – allows researchers to induce heritable mutations at specific target sites in the genome to enable high resolution lineage tracing.
  • Standard phylogenetic algorithms – Neighbor Joining (NJ), UPGMA, and hierarchical clustering have been used to infer phylogenetic trees from lineage tracing data.
  • These models don’t model features that distinguish (CRISPR Cas9) lineage tracing data from molecular data.
  • CRISPR-Cas9 has high rates of homoplasy and high rates of missing data (20%-40%).
  • Lineage tracing data has a large number of derived states for a small number of site.
  • Specialized lineage tracing methods – Cassiopeia, LinTIMaT, GAPML used to infer lineage trees from lineage tracing data – use heuristics to handle missing data, leading to poor performance with typical dropout rates.
  • These models don’t model several features that are characteristic of CRISPR-Cas9 induced mutations.

3 of 31

CRISPR Cas9

  • Cas9 protein is led by guide RNAs to create double stranded breaks in specific sequences
    • Target Sites
  • Breaks are repaired by error-prone DNA repair mechanisms – can result in heritable insertions/deletions
  • Mutations render target site distinct from guide RNA – preventing further action of Cas9.
  • Therefore, CRISPR Cas9 induced mutations are non-modifiable.
  • ”A target site may mutate multiple times during the course of evolution but is constrained to mutate at most once along any lineage.”
  • Lineage Tracing: Other models have not explored the combinatorial properties of non-modifiability under that maximum parsimony framework.

Javier Zarracina/Vox

4 of 31

Star homoplasy model

  • Target site (character) can mutate only once along a lineage.
  • Same mutation can occur independently in multiple cells – homoplasy
    • Number of homoplasies is typically unbounded
    • Special case – k-star homoplasy where a mutation n can occur at most k times across all lineages
  • 0 – ancestral unmutated state; 1,2, …, r – mutated states. For each lineage there is at most one transition from 0 to {1, 2, …, r}
  • State transition graphs – describes the allowed state transitions between states of a character.
    • Vertices – States of a character; Directed Edges – Allowed Transitions; Edge Multiplicity - # Homoplasies.

Single cell

RNA sequencing

Startle

5 of 31

State transition multigraphs

  • Two-state perfect phylogeny – a character can change state at most once in the the phylogeny
  • Two state 2-Camin Sokal – character can change state at most twice in the phylogeny
  • Camin Sokal – imposes an ordering on mutations. Only allows transitions that respect this ordering. Enforces irreversibility of mutations.
    • Multigraph of k-Camin Sokal is a tree, with all internal vertices (except root) having one incoming edge and one outgoing edge with multiplicity k.
  • Star homoplasy – most similar to Camin Sokal. Enforces non-modifiability.
    • Multigraph of k-Star homoplasy is a star tree where each edge has multiplicity k.
  • Two state characters – Star Homoplasy and Camin Sokal are identical.

6 of 31

Reconstructing Single Cell Phylogenies

  •  

7 of 31

Star Homoplasy Phylogenies

  •  

8 of 31

Mutation Weights

  •  

9 of 31

Large parsimony problem – SH Model

  •  

10 of 31

Small parsimony problem – SH Model

  •  

11 of 31

Small parsimony problem – SH Model

  •  

12 of 31

Small parsimony problem – SH Model

  •  

 

13 of 31

Small parsimony problem – SH Model

  •  

 

14 of 31

Large parsimony problem – k-SH Model

  •  

15 of 31

Combinatorial characterization of k-star homoplasy matrices

  •  

16 of 31

Combinatorial characterization of k-star homoplasy matrices

  •  

17 of 31

PROOF: binarization to k-star homoplasy phylogeny

  •  

18 of 31

PROOF: binarization to k-star homoplasy phylogeny

  •  

19 of 31

PROOF: k-star homoplasy to binarization

  •  

20 of 31

Startle-ILP

  •  

21 of 31

Startle-ILP

  •  

22 of 31

Startle-ILP

  •  

23 of 31

Startle-ILP

  •  

24 of 31

Startle-NNI

  • Second approach to solve Large-SH Parsimony - searches through tree space using subtree exchange operations.
  • Start with a small set of of candidate binary trees – locally improve using sub-tree interchange operations.
  • Store set of candidate trees – selection one uniformly at random. Stochastically perturb selected tree using random nearest neighbor interchange (NNI) operations.
  • Apply hill-climbing to minimize W(T) until we reach a local minimum.
  • Once a local minimum is reached check if parsimony score improves on any candidate trees. Update candidate set.

25 of 31

Startle-NNI

  •  

26 of 31

Test Suite

  •  

27 of 31

Test Suite

  •  

28 of 31

Synthetic Results

  •  

29 of 31

Synthetic Results

  •  

30 of 31

Results on Published Data

  • Applied Startle to analyze two datasets from recently published CRISPR-based lineage tracing experiment in mouse models of metastatic lung adenocarcinoma.
  • Startle infers more parsimonious phylogenies than the published phylogenies. The following are scores for the first and second datasets respectively
    • Startle score = 315.51, 4715.5
    • Published score = 317.55, 4827.43
  • No groundtruth data – differences between Startle, Published evaluated by examining consistency of inferred phylogenies.
    • Computed normalized RF distance between lineage trees for Primary Tumor and Primary Tumor + Metastases
    • Startle normalized RF = 0.15, 0.525, Published normalized RF = 0.22, 0.6

31 of 31

Results on Published Data

  • Startle inferred phylogenies yield more parsimonious description of metastatic processes.
  • Given a labeling for the anatomical site for each leaf in phylogeny T. Find a label for the anatomical site for each ancestral vertex that minimizes the number of migrations.
  • Threshold: pairs of sites that have more than 10 migrations in both phylogenies (Startle and Published).
  • The following are the suggested migrations:
    • Second Dataset: Startle – 142, Published – 194
    • Primary to Lymph Node: Startle – 19, Published – 22
    • Primary to Kidney: Startle – 2, Published – 2