1 of 42

Inference of single-cell phylogenies from lineage tracing data using Cassiopeia

Jones et al., Genome Biology 2020

Presented by: Kowshika Sarker

2 of 42

Contributions

  • Cassiopeia – a suite of phylogeny reconstruction algorithms from lineage tracing data

  • A simulation framework for benchmarking such algorithms

  • A new experimental lineage tracing dataset

  • Insights for improved lineage tracer technology design

3 of 42

CRISPR-Cas9

4 of 42

Existing Works

  • Algorithms like neighbor joining or Camin-Sokal
    • Cannot scale to tens of thousands of cells

    • Cannot handle a lot of missing data
      • Heritable missing data –  Cas9-induced deletions
      • Stochastic missing data – sequencing error

    • Not robust to varying parameters
      • Mutation rate, target site count etc.

5 of 42

Cassiopeia

  • Cassiopeia-ILP (<200 cells)
    • The most parsimonious phylogeny using a Steiner tree approach
    • Highly optimal

  • Cassiopeia-Greedy (>10,000 cells)
    • Phylogenies from mutations likely occurring earliest in the experiment
    • Faster runtime

  • Cassiopeia-Hybrid (>200 & < 10,000 cells)
    • Scalable like Cassiopeia-Greedy and exact like the Cassiopeia-ILP

6 of 42

Cassiopeia-ILP

Steiner tree (ST) problem

G = (V , E): STVs → A minimum cost tree that spans Vs

Vs = V: STVs = MST → Polynomial algorithm

In general, STVs is NP-hard

7 of 42

Cassiopeia-ILP

  • Pick pair of source nodes Va and Vb
  • Find their LCA: Vab
  • For ith character,
    • if Va (i) = Vb(i), then Vab(i) = Va (i) = Vb(i),
    • Else, Vab(i) is unmutated
  • Edit distance of Va and Vb < threshold
    • Directed edges Vab(i) → Va (i) and Vab(i) → Vb (i)
    • Edge weight: Edit distance of parent and source
  • Repeat till only root is left as source node

8 of 42

Cassiopeia-ILP

9 of 42

Cassiopeia-ILP

  • ILP: a type of optimization problem where the variables are integer values and the objective function and equations are linear.

10 of 42

Cassiopeia-Greedy

11 of 42

Cassiopeia-Greedy

  • Character (i), mutation (j) pair to split
    • pi - Prior probability
    • ni,j – Cell count with j at i
    • fi,j – Some function of ni,j
      • Best fi,j = ni,j

Without

prior

With

prior

12 of 42

Cassiopeia-Greedy

Perfect phylogeny

  • Matric M: n cells, n characters, k states
  • There is a zero root perfect phylogeny if a tree T exists-
    • Root is all zeroes
    • Every character state is mutated into at most once
    • All non-leaf nodes have at least 2 children

13 of 42

Cassiopeia-Greedy

Perfect phylogeny

14 of 42

Cassiopeia-Greedy

15 of 42

Cassiopeia-Greedy

16 of 42

Cassiopeia-Greedy

17 of 42

Cassiopeia-Hybrid

Split with cutoff for each subset (e.g., 200 cells)

18 of 42

Evaluation: Accuracy

19 of 42

Evaluation: Optimality

20 of 42

Evaluation: Scalability

21 of 42

Evaluation: Scalability

22 of 42

Evaluation: Robustness

Bootstrapping analysis

Sampling with replacement

Character – M sampled from M

Tree – 100 sampled from 10

Metric: Transfer Bootstrap Expectation (TBE)

minimum number of taxa to be removed to make the two bipartitions identical

23 of 42

Evaluation: Robustness

24 of 42

Evaluation: Parallel Evolution

25 of 42

Evaluation: Parallel Evolution

Cassiopeia-Greedy

26 of 42

Evaluation: Parallel Evolution

  • Mutations observed at high frequency at the leaves likely occur earlier in the phylogeny

27 of 42

Simulation Framework

  • Variable hyperparameters
    • no. of characters (Cas9 target sites)
    • number of states (possible Cas9-induced indels)
    • probability distribution over these states
    • mutation rate per character
    • number of cell generations
    • amount of missing data

28 of 42

Simulation Framework

  • Can fit real data by estimating parameters
    • Estimates default values of variable hyperparameters
    • Per character mutation rate
    • Prior probabilities of indels

    • K = average #mutation across cells
    • n = No. of target sites
    • d = No. of generations

    • f(s) = No. of intBC that has the indel s in at least one cell
    • I = No. of total intBC in the dataset

29 of 42

Lineage Tracer Design Insights

Information Capacity (IC)

    • No. of characters
    • No. of possible states

  • Cassiopeia performance improves with high IC
  • NJ and CS performance degrade

30 of 42

Lineage Tracer Design Insights

Indel Distribution

θ: 0 < θ < 1



Greater θ → better accuracy

Especially for low #states or large #samples

31 of 42

Lineage Tracer Design Insights

  • Increases in information capacity (more target sites or possible indels)

  • Tuned mutation rate for low parallel evolution

32 of 42

Experimental Dataset

  • Existing experimental datasets lack a ground truth phylogeny

34,557 cells

33 of 42

Experimental Dataset

  • Intra-plate cells are expected to be closer than inter-plate cells
  • Some inter-plate cells may be closer than own plate-mates due to indels introduces before the splits

34 of 42

Evaluation: Experimental dataset

35 of 42

Evaluation: Experimental dataset

36 of 42

Evaluation: Experimental dataset

Categorical variable – Plate ID in experiment

37 of 42

Evaluation: Experimental dataset

38 of 42

Evaluation: Experimental dataset

39 of 42

Evaluation: Generalizability

GESTALT linage tracing data

40 of 42

Evaluation: Generalizability

Emerging base-editor tracers

    • Precisely edit A > G, C > T, or possibly C > N (N being any base)
    • Increased target sites, few possible states (4 at best)
    • Less dropout/missing data
    • Less cytotoxic due to independence from double-stranded breaks of Cas9

High-character low-state regimes

41 of 42

Evaluation: Generalizability

Varying mutation rate across target sites

42 of 42

Future Directions

  • Exploring maximum likelihood or Bayesian inference with prior indel probabilities
  • Resolving branch length of maximum parsimony solutions
    • Possibly with paired transcriptomic data
  • Explicitly differentiating stochastic and heritable missing data
    • Possibly with supertree methods
  • Designing alternative greedy heuristics