Efficient storage and analysis of ARGs using the tree sequence format
Node annotation versus edge annotation
Griffiths (1991) described an ARG format where each “recombination node” ( ) is annotated with a single breakpoint (fig A). However, this cannot easily represent inheritance processes such as gene conversion and multiple breakpoints. The equivalent tskit representation (fig B) annotates edges with intervals. Intervals represent regions of genome directly transmitted between parent and child and can easily allow for arbitrary inheritance. Moreover they can be modified to trace only those regions that are inherited by the sampled genomes (a, b, and c): we call this a sample resolved ARG.
Sample resolution and efficiency
Sample-resolved edge annotations (fig C) allow efficient extraction of local trees. In particular, it is easy to figure out which edges change from one tree to the next without traversing the entire graph (Kelleher et al., 2016). This is vital for large ARGs: in our ARG of one million genomes from the UK Biobank, it takes only 11 seconds to iterate over 16 thousand trees on chromosome 20 (Kelleher et al., 2019). This is key to tskit’s efficient analysis methods.
Converting a direct inheritance ARG (fig B) to the sample-resolved version (fig C) is equivalent to running Hudson's (1983) algorithm on a fixed ARG topology, e.g. by using tskit’s ‘simplify’ algorithm which is used widely in forwards simulation.
Event ARGs versus Genome ARGs
The three node types in Fig A. correspond to different events (recombination= common ancestor= and sampling). In a tskit ARG, nodes represent genomes. Edges connected to a single node can then represent multiple events. In fig D, the recombination nodes involving breakpoints at positions 4 and 6 have been "simplified" away, with breakpoint information retained on edges above node b. The genealogy is smaller and contains only “knowable” nodes (i.e. those directly inferable from the genetic sequence). This is important for ARG inference, as complete precision (e.g. times of recombination or participating lineages) is not required. It also applies to coalescence: nodes can have >2 children (not shown).
Definitions
ARG: A graph structure representing genetic ancestry of a set of sampled genomes with recombination†
Ts kit: The "tree sequence toolkit" – Open Source software that can store different sorts of ARG in the succinct tree sequence format
Software that uses tskit to store ARGs
Native: Msprime, SLiM, Tsinfer, Slendr, Fwdpy11, Espalier
Import/Export: Relate, ARGinfer, ARG-Needle
Pros
Cons
Fig A.
"Event" ARG or eARG
eARG = problematic format
Direct inheritance
Pros
Cons
Tskit format: gARG (nodes represent genomes), edges annotated
Sample-resolved inheritance
† as per Minichiello & Durbin (2006) we distinguish this from the neutral generating process of the CwR (coalescent-with-recombination)
Tskit traces genetic transmission by edge annotation, allowing flexible representation of different sorts of ARG, e.g.
Equivalent to a "standard" node-annotated ARG (fig A) as e.g. proposed by Griffiths (1991)
Proposed by Hudson (1981) and e.g. simulated using msprime's record_full_arg option
Containing only nodes in which coalescence occurs (Kelleher, 2016), e.g the default msprime output and inferred by tsinfer
Also known as an “interval-tree ARG” (Kuhner & Yamoto, 2017): only the sample nodes are fully shared between trees, which results in independent or semi-independent local trees e.g. as inferred by Relate
References
Griffiths (1991) Proc Sheffield Symp Appl Prob 18: 100-117
Hudson (1983) Theor Pop Biol 23: 183-201
Kelleher et al. (2016) PLoS Comp Biol 12: e1004842
Kelleher et al. (2019) Nat Genet 51:1330-1338
Kuhner & Yamoto (2017) J Mol Evol 84:129-138
Minichiello & Durbin (2006) Am J Hum Genet 79: 910-922
Wiuf & Hein (1999) Theor Pop Biol 55: 248-259
| Local tree extraction | Recombination precision | Likelihood under CwR | “Knowable” nodes only | |
Direct inheritance | Full ARG (figs A, B) | Inefficient | Complete | Calculable | No |
Sample- resolved inheritance | Full ARG (fig C) | Efficient | Imprecise for “trapped” non- ancestral breakpoints | Calculable | No |
Coalescent ARG (fig D) | Efficient | Less detail | No | Yes | |
Interval tree ARG | Inefficient | Less detail | No | Yes |
How do different ARGs stack up?
What ARGs can be represented in tskit?
Pros
Cons
Breakpoint location
Example ARGs based on Wiuf & Hein (1999); breakpoints at 2, 3, 4, & 6
Fig B.
Equivalent "genome" ARG
Fig C.
Sample-resolved full ARG
Fig D.
Coalescent ARG
Pros
Cons
Affiliations & Acknowledgments
1 Big Data Institute, Li Ka Shing Centre for Health� Information & Discovery, University of Oxford, UK
2 Department of Statistics, University of Oxford, UK
3 Department of Statistics, University of Warwick, UK
Thanks to our funders:
Y.Wo. & J.Ke funded by the Robertson Foundation
A.Ig. funded by the Wellcome Trust
J.Ko funded by EPSRC grant EP/V049208/1
Y. Wong1, A. Ignatieva2, J. Koskela3, J. Kelleher1
See more at https://tskit.dev