1 of 1

Efficient storage and analysis of ARGs using the tree sequence format

Node annotation versus edge annotation

Griffiths (1991) described an ARG format where each “recombination node” ( ) is annotated with a single breakpoint (fig A). However, this cannot easily represent inheritance processes such as gene conversion and multiple breakpoints. The equivalent tskit representation (fig B) annotates edges with intervals. Intervals represent regions of genome directly transmitted between parent and child and can easily allow for arbitrary inheritance. Moreover they can be modified to trace only those regions that are inherited by the sampled genomes (a, b, and c): we call this a sample resolved ARG.

Sample resolution and efficiency

Sample-resolved edge annotations (fig C) allow efficient extraction of local trees. In particular, it is easy to figure out which edges change from one tree to the next without traversing the entire graph (Kelleher et al., 2016). This is vital for large ARGs: in our ARG of one million genomes from the UK Biobank, it takes only 11 seconds to iterate over 16 thousand trees on chromosome 20 (Kelleher et al., 2019). This is key to tskit’s efficient analysis methods.

Converting a direct inheritance ARG (fig B) to the sample-resolved version (fig C) is equivalent to running Hudson's (1983) algorithm on a fixed ARG topology, e.g. by using tskit’s ‘simplify’ algorithm which is used widely in forwards simulation.

Event ARGs versus Genome ARGs

The three node types in Fig A. correspond to different events (recombination= common ancestor= and sampling). In a tskit ARG, nodes represent genomes. Edges connected to a single node can then represent multiple events. In fig D, the recombination nodes involving breakpoints at positions 4 and 6 have been "simplified" away, with breakpoint information retained on edges above node b. The genealogy is smaller and contains only “knowable” nodes (i.e. those directly inferable from the genetic sequence). This is important for ARG inference, as complete precision (e.g. times of recombination or participating lineages) is not required. It also applies to coalescence: nodes can have >2 children (not shown).

Definitions

ARG: A graph structure representing genetic ancestry of a set of sampled genomes with recombination^†

Ts kit: The "tree sequence toolkit" – Open Source software that can store different sorts of ARG in the succinct tree sequence format

Software that uses tskit to store ARGs

Native: Msprime, SLiM, Tsinfer, Slendr, Fwdpy11, Espalier

Import/Export: Relate, ARGinfer, ARG-Needle

Pros

Simple notation
Easy likelihood calculation

Cons

Tied to events in generative model (e.g. no gene conversion)
Inefficient to extract local trees
L+R edges over a RE node need distinguishing e.g. by edge order ∴ hard to simulate

Fig A.

"Event" ARG or eARG

eARG = problematic format

Direct inheritance

Pros

Fast tree extraction
CwR likelihood still calculable
Edge spans reflect transmission to samples (shown e.g. in line width)
Can be simplified to remove nodes involved in recombination ( ) & common ancestors with no genetic coalescence ( )

Cons

Loses exact location of trapped breakpoints (e.g. 3)

Tskit format: gARG (nodes represent genomes), edges annotated

Sample-resolved inheritance

^† as per Minichiello & Durbin (2006) we distinguish this from the neutral generating process of the CwR (coalescent-with-recombination)

Tskit traces genetic transmission by edge annotation, allowing flexible representation of different sorts of ARG, e.g.

ARGs with direct inheritance (fig B)

Equivalent to a "standard" node-annotated ARG (fig A) as e.g. proposed by Griffiths (1991)

ARGs with sample resolved inheritance

"Full" ARGs (fig C)

Proposed by Hudson (1981) and e.g. simulated using msprime's record_full_arg option

"Coalescent" ARGs (fig D)

Containing only nodes in which coalescence occurs (Kelleher, 2016), e.g the default msprime output and inferred by tsinfer

"Tree-by-tree" ARGs (not shown)

Also known as an “interval-tree ARG” (Kuhner & Yamoto, 2017): only the sample nodes are fully shared between trees, which results in independent or semi-independent local trees e.g. as inferred by Relate

References

Griffiths (1991) Proc Sheffield Symp Appl Prob 18: 100-117

Hudson (1983) Theor Pop Biol 23: 183-201

Kelleher et al. (2016) PLoS Comp Biol 12: e1004842

Kelleher et al. (2019) Nat Genet 51:1330-1338

Kuhner & Yamoto (2017) J Mol Evol 84:129-138

Minichiello & Durbin (2006) Am J Hum Genet 79: 910-922

Wiuf & Hein (1999) Theor Pop Biol 55: 248-259

		Local tree extraction	Recombination precision	Likelihood under CwR	“Knowable” nodes only
Direct inheritance	Full ARG (figs A, B)	Inefficient	Complete	Calculable	No
Sample- resolved inheritance	Full ARG (fig C)	Efficient	Imprecise for “trapped” non- ancestral breakpoints	Calculable	No
	Coalescent ARG (fig D)	Efficient	Less detail	No	Yes
	Interval tree ARG	Inefficient	Less detail	No	Yes

How do different ARGs stack up?

What ARGs can be represented in tskit?

Pros

Single node type
Explicit annotation of transmission allows any genetic inheritance model
Can be sample resolved

Cons

Still inefficient to extract trees

Breakpoint location

Example ARGs based on Wiuf & Hein (1999); breakpoints at 2, 3, 4, & 6

Fig B.

Equivalent "genome" ARG

Fig C.

Sample-resolved full ARG

Fig D.

Coalescent ARG

Pros

Fastest tree extraction
Smaller & more efficient as only “knowable” nodes shown

Cons

Less precise information about recombination events (∴ cannot easily calculate CwR likelihood)

Affiliations & Acknowledgments

¹Big Data Institute, Li Ka Shing Centre for Health� Information & Discovery, University of Oxford, UK

²Department of Statistics, University of Oxford, UK

³Department of Statistics, University of Warwick, UK

Thanks to our funders:

Y.Wo. & J.Ke funded by the Robertson Foundation

A.Ig. funded by the Wellcome Trust

J.Ko funded by EPSRC grant EP/V049208/1

Y. Wong¹, A. Ignatieva², J. Koskela³, J. Kelleher¹

See more at https://tskit.dev