1 of 1

Efficient storage and analysis of ARGs using the tree sequence format

Node annotation versus edge annotation

Griffiths (1991) described an ARG format where each “recombination node” ( ) is annotated with a single breakpoint (fig A). However, this cannot easily represent inheritance processes such as gene conversion and multiple breakpoints. The equivalent tskit representation (fig B) annotates edges with intervals. Intervals represent regions of genome directly transmitted between parent and child and can easily allow for arbitrary inheritance. Moreover they can be modified to trace only those regions that are inherited by the sampled genomes (a, b, and c): we call this a sample resolved ARG.

Sample resolution and efficiency

Sample-resolved edge annotations (fig C) allow efficient extraction of local trees. In particular, it is easy to figure out which edges change from one tree to the next without traversing the entire graph (Kelleher et al., 2016). This is vital for large ARGs: in our ARG of one million genomes from the UK Biobank, it takes only 11 seconds to iterate over 16 thousand trees on chromosome 20 (Kelleher et al., 2019). This is key to tskit’s efficient analysis methods.

Converting a direct inheritance ARG (fig B) to the sample-resolved version (fig C) is equivalent to running Hudson's (1983) algorithm on a fixed ARG topology, e.g. by using tskit’s ‘simplify’ algorithm which is used widely in forwards simulation.

Event ARGs versus Genome ARGs

The three node types in Fig A. correspond to different events (recombination= common ancestor= and sampling). In a tskit ARG, nodes represent genomes. Edges connected to a single node can then represent multiple events. In fig D, the recombination nodes involving breakpoints at positions 4 and 6 have been "simplified" away, with breakpoint information retained on edges above node b. The genealogy is smaller and contains only “knowable” nodes (i.e. those directly inferable from the genetic sequence). This is important for ARG inference, as complete precision (e.g. times of recombination or participating lineages) is not required. It also applies to coalescence: nodes can have >2 children (not shown).

Definitions

ARG: A graph structure representing genetic ancestry of a set of sampled genomes with recombination

Ts kit: The "tree sequence toolkit" – Open Source software that can store different sorts of ARG in the succinct tree sequence format

Software that uses tskit to store ARGs

Native: Msprime, SLiM, Tsinfer, Slendr, Fwdpy11, Espalier

Import/Export: Relate, ARGinfer, ARG-Needle

Pros

  • Simple notation
  • Easy likelihood calculation

Cons

  • Tied to events in generative model (e.g. no gene conversion)
  • Inefficient to extract local trees
  • L+R edges over a RE node need distinguishing e.g. by edge order ∴ hard to simulate

Fig A.

"Event" ARG or eARG

eARG = problematic format

Direct inheritance

Pros

  • Fast tree extraction
  • CwR likelihood still calculable
  • Edge spans reflect transmission to samples (shown e.g. in line width)
  • Can be simplified to remove nodes involved in recombination ( ) & common ancestors with no genetic coalescence ( )

Cons

  • Loses exact location of trapped breakpoints (e.g. 3)

Tskit format: gARG (nodes represent genomes), edges annotated

Sample-resolved inheritance

as per Minichiello & Durbin (2006) we distinguish this from the neutral generating process of the CwR (coalescent-with-recombination)

Tskit traces genetic transmission by edge annotation, allowing flexible representation of different sorts of ARG, e.g.

  • ARGs with direct inheritance (fig B)

Equivalent to a "standard" node-annotated ARG (fig A) as e.g. proposed by Griffiths (1991)

  • ARGs with sample resolved inheritance
    • "Full" ARGs (fig C)

Proposed by Hudson (1981) and e.g. simulated using msprime's record_full_arg option

    • "Coalescent" ARGs (fig D)

Containing only nodes in which coalescence occurs (Kelleher, 2016), e.g the default msprime output and inferred by tsinfer

    • "Tree-by-tree" ARGs (not shown)

Also known as an “interval-tree ARG” (Kuhner & Yamoto, 2017): only the sample nodes are fully shared between trees, which results in independent or semi-independent local trees e.g. as inferred by Relate

References

Griffiths (1991) Proc Sheffield Symp Appl Prob 18: 100-117

Hudson (1983) Theor Pop Biol 23: 183-201

Kelleher et al. (2016) PLoS Comp Biol 12: e1004842

Kelleher et al. (2019) Nat Genet 51:1330-1338

Kuhner & Yamoto (2017) J Mol Evol 84:129-138

Minichiello & Durbin (2006) Am J Hum Genet 79: 910-922

Wiuf & Hein (1999) Theor Pop Biol 55: 248-259

Local tree extraction

Recombination precision

Likelihood under CwR

“Knowable” nodes only

Direct inheritance

Full ARG

(figs A, B)

Inefficient

Complete

Calculable

No

Sample- resolved

inheritance

Full ARG

(fig C)

Efficient

Imprecise for “trapped” non- ancestral breakpoints

Calculable

No

Coalescent ARG (fig D)

Efficient

Less detail

No

Yes

Interval tree ARG

Inefficient

Less detail

No

Yes

How do different ARGs stack up?

What ARGs can be represented in tskit?

Pros

  • Single node type
  • Explicit annotation of transmission allows any genetic inheritance model
  • Can be sample resolved

Cons

  • Still inefficient to extract trees

Breakpoint location

Example ARGs based on Wiuf & Hein (1999); breakpoints at 2, 3, 4, & 6

Fig B.

Equivalent "genome" ARG

Fig C.

Sample-resolved full ARG

Fig D.

Coalescent ARG

Pros

  • Fastest tree extraction
  • Smaller & more efficient as only “knowable” nodes shown

Cons

  • Less precise information about recombination events (∴ cannot easily calculate CwR likelihood)

Affiliations & Acknowledgments

1 Big Data Institute, Li Ka Shing Centre for Health� Information & Discovery, University of Oxford, UK

2 Department of Statistics, University of Oxford, UK

3 Department of Statistics, University of Warwick, UK

Thanks to our funders:

Y.Wo. & J.Ke funded by the Robertson Foundation

A.Ig. funded by the Wellcome Trust

J.Ko funded by EPSRC grant EP/V049208/1

Y. Wong1, A. Ignatieva2, J. Koskela3, J. Kelleher1

See more at https://tskit.dev