Building, understanding, and using pangenomes
Erik Garrison
University of Tennessee (UTHSC), Memphis
@Workshop on Genomics, Český Krumlov
January 13, 2025
Key model: variation graph = nodes, edges, paths
Key model: variation graph = nodes, edges, paths
Embeds genomes in a common sequence graph
New ideas often have a long history
St. Agatha pipetting a biosample into a nanopore sequencer
c. 1420
This all seems cool and “new” but ideas are rarely that.
Pangenomes and variation graphs have a long† history.
(†for genomics)
pipette
biosample
nanopore sequencer
nine versions Valerio Magrelli’s
poem “Campagna Romana” (1981)
variation graphs: not a new idea!
Group B Streptococcus assemblies from 2002
Pangenome: not a new concept
First collections of multiple genomes from the same species demonstrated substantial differences.
This was unexpected and required new theory to understand.
A single reference is not enough to explain genomic diversity. Even many genomes may not be enough.
Some genes are shared among all individuals: these are “core”, while others are not—we call them “accessory”.
Lessons from language modeling: Heaps’ law
A pangenome is:
Closed: our observations of new genes with new genomes diminish.
Open: we continue to see new genes as we add more genomes.
The exponent α determines whether the pangenome is open (α ≤ 1) or closed (α > 1). The top panel shows data for an open pangenome species, P. marinus; the bottom panel for a closed pangenome species, S. aureus
Pangenome research timeline
2000-2010s: counting genes
~2015: let’s take it to the sequence level (genome graphs)
2020s: complete assemblies (T2T pangenomes)
github.com/vgteam/vg
“Computational pan-genomics: status, promises and challenges” https://doi.org/10.1093%2Fbib%2Fbbw089
yup… we can generalize most standard bioinformatic algorithms to graphs, as in Partial Order Alignment →
Wait! You can align sequences to graphs?
reads aligned to a variation graph
vg resolves reference bias at known indels in HG002
50x 2x150bp Illumina sequencing of HG002
Using variation graphs to observe CCR5-delta in ancient samples
Rui Martiniano
Exonic deletion in the HGSVC dataset correctly genotyped by vg
reads vs. graph
reads vs. GRCh38
vg giraffe: approach
vg giraffe
is accurate
enough
vg giraffe is very fast
vg giraffe
improves
variant calling
vg giraffe lets us scale: PCA from SVs in 5k genomes
Building a draft
human pangenome
Q1 2025
Sample selection was constrained by:
Draft pangenome composition
Amazing assemblies approach reference quality
Haplotype-resolved assemblies from trio-hifiasm.
They are really good, according to realignment of reads to the assemblies and model of assembly completeness—nearly as good as T2T-CHM13!
Mobin Asri
Then we made 5 pangenome (reference) graphs…
Minigraph
Minigraph-Cactus
Key conceptual differences between HPRC pangenome construction methods
minigraph: just SVs, no complex stuff, one reference.
minigraph-cactus: add SNPs, clean up the breakpoints, useful for alignment, one reference.
pggb: everything-vs-everything, hard to align to, useful for studying evolution and pangenome structure at all scales, all genomes are references.
“Collapse” in high-copy repeats →
minigraph-Cactus creates a hierarchical pangenome rooted in the reference genome, ensuring compatibility with standard tools.
PGGB creates graphs in which each genome can act as a reference, so we choose our reference as needed by later analysis or work in graph space.
Why do we need a human pangenome?
T2T-CHM13 adds ~200MB of heterochromatin to reference.
Draft pangenome adds ~100MB of polymorphic euchromatin (and a lot more heterochromatin),
0.6-4.4 Mb of additional genic sequences per haplotype compared to GRCh38 (38 gene CNVs/haplotype).
Pangenome growth (PGGB)
PGGB: all chromosomes, layout with path-guided SGD
Simon Heumos
C4A/B in pggb graph
MHC class II
6p
MHC
centromere
C4A/B in pggb graph
Christian Fischer (UTHSC)
MHC class II
Genbank annotations on top of the pggb HPRC graph.
C4A/B in pggb graph
Christian Fischer (UTHSC)
C4A/B
MHC class II
Genbank annotations on top of the pggb HPRC graph.
C4A/B in pggb graph
Christian Fischer (UTHSC)
copy number is related to schizophrenia risk
We learn that genome evolution is often nonlinear
complement component 4 locus
Large SVs predominantly occur at VNTRs which are simply loops in our pggb graphs.
We learn that genome evolution is often nonlinear
C4A
C4A
C4B
C4B
C4B
Amylase: how humans evolved a taste for agriculture
Amylase breaks starch into simpler sugars
48
Amylase is a multi-copy gene family
Human amylase copy number diversity
50
28 amylase structural haplotypes
11 consensus structures
Deconvolving haplotypes from short reads
51
Recent evolution from 533 ancient European genomes
Data from Allentoft et al. 2024, Nature
Evidence for selection of high-copy amylase haplotypes
Allele frequency changes over last ~12,000 years in eurasian populations.
Evidence for selection of high-copy amylase haplotypes
Allele frequency changes over last ~12,000 years in eurasian populations.
Recombination between heterologous acrocentric chromosomes
Workflow
HPRC assemblies
Mapping against the whole CHM13
PanGenome Graph Builder (PGGB)
Acrocentric contigs covering (+/- 1Mbp) both the p and q arms (pq-contigs)
+ HG002 contigs >= 300kbps� which map to acrocentrics
Untangling the pangenome graph
We look into the graph from the perspective of chromosome 13. Full information from pangenome plus reference annotations.
Untangling the pangenome graph
We look into the graph from the perspective of chromosome 13. Full information from pangenome plus reference annotations.
Untangling the pangenome graph
We look into the graph from the perspective of chromosome 13. Full information from pangenome plus reference annotations.
Untangling the pangenome graph
We look into the graph from the perspective of chromosome 13. Full information from pangenome plus reference annotations.
64
Pseudo Homologous Regions
14
21
FISH probes from GRC clones
Complete sequence of Robertsonian chromosomes
de Lima*, Guarracino* et al., in preparation
GM03786
GM04890
t(13;14)
t(14;21)
GM03417
Pseudo-homologous regions (PHRs)
Why was it not possible to see this before using traditional laboratory techniques?
← 300 FISHing experiments to find one heterologous synapse.
Implicit pangenomes
Heumos et. al. https://doi.org/10.1093/bioinformatics/btae363
Pangenome graph of the HPRCy1 including 88 haplotypes.
Represents all variation of all types between all genomes.
Pangenome graphs = Pangenome alignments
Alignments can be represented as sets of matches, which are just pairs of positions in two genomes.
Garrison & Guarracino 2022 https://doi.org/10.1093/bioinformatics/btac743
Pangenome graphs = Pangenome alignments
To build the graph, we condense matched genome letters into single nodes.
Garrison & Guarracino 2022 https://doi.org/10.1093/bioinformatics/btac743
Pangenome graphs = Pangenome alignments
Variation graph represents all alignments and all genomes.
Garrison & Guarracino 2022 https://doi.org/10.1093/bioinformatics/btac743
Is this the end of genomics?
One model:
GENOME
ALL
Y’ALL
Is this the end of genomics?
One model:
But… there are problems…
GENOME
ALL
Y’ALL
1.05 Terabytes
One chromosome.
You’re not alone agolicz.
You’re not alone Wwwwwwwyc.
😭😭😭
Don’t cry, wjwei-handsome
Building pangenome graphs is hard…
Why are graphs hard?
Graphs are “sticky”...
Anything can connect to anything else.
Nonlinear relationships = you never know what you need to know.
That means you need to remember everything, all at once.
==
S. cerevisiae pangenome graph
graph adjacency matrix
(node:node edges)
from Yue et. al. 2017 https://doi.org/10.1038/ng.3847
Why are graphs hard?
Graphs are “sticky”...
Anything can connect to anything else.
Nonlinear relationships = you never know what you need to know.
That means you need to remember everything, all at once.
Why are graphs hard?
Graphs are “sticky”...
Anything can connect to anything else.
Nonlinear relationships = you never know what you need to know.
That means you need to remember everything, all at once.
We cheated when we made the HPRCy1 graphs.
Yup.
We broke it up by chromosome.
🙈
🙊
Why does it matter if we cheated?
Guarracino et al. 2023 is why.
Heterologous acrocentric chromosomes recombine.
Recombination occurs here
Let’s overcomplicate things
→
→
alignments
pangenome graph
… “untangling”
from Guarracino et al. 2023.
Let’s overcomplicate things
→
→
alignments
pangenome graph
untangling means
… alignments?
The graph vanishes
→
→
alignments
pangenome graph
… alignments?
X
Implicit data structures
Useful pangenome queries:
But, we want to avoid building the whole graph up front to get these.
We can obtain these with alignments and an implicit representation of the graph.
Implicit interval tree
100 150
130 200
170 300
180 250
200 250
250 350
270 300
300 320
350 450
390 420
==
IIT
IMPG: IMplicit Pangenome Graphs
Alignment [start, end) → intervals → implicit interval tree (per genome)
wfmash
Querying impg
Querying impg
Querying impg
Querying impg
Querying impg
Querying impg
Querying impg
Querying impg
Querying impg
Result is the “transitive closure” of aligned ranges in pangenome. If graph == alignments, a subgraph query.
Telomere-to-telomere great ape assemblies
Homo sapiens (human)
Symphalangus syndactylus (siamang gibbon)
Gorilla gorilla (gorilla)
Pongo pygmaeus (Bornean orangutan)
Pongo abelii (Sumatran orangutan)
Pan troglodytes (chimpanzee)
Pan paniscus (bonobo)
phylogeny from PGGB graph on hsa_chr6
Telomere-to-telomere great ape assemblies
Homo sapiens (human)
Symphalangus syndactylus (siamang gibbon)
Gorilla gorilla (gorilla)
Pongo pygmaeus (Bornean orangutan)
Pongo abelii (Sumatran orangutan)
Pan troglodytes (chimpanzee)
Pan paniscus (bonobo)
phylogeny from PGGB graph on hsa_chr6
Telomere-to-telomere great ape assemblies
Homo sapiens (human)
Symphalangus syndactylus (siamang gibbon)
Gorilla gorilla (gorilla)
Pongo pygmaeus (Bornean orangutan)
Pongo abelii (Sumatran orangutan)
Pan troglodytes (chimpanzee)
Pan paniscus (bonobo)
phylogeny from PGGB graph on hsa_chr6
1 graph.
35 days.
job failed
Telomere-to-telomere great ape assemblies
Homo sapiens (human)
Symphalangus syndactylus (siamang gibbon)
Gorilla gorilla (gorilla)
Pongo pygmaeus (Bornean orangutan)
Pongo abelii (Sumatran orangutan)
Pan troglodytes (chimpanzee)
Pan paniscus (bonobo)
phylogeny from PGGB graph on hsa_chr6
1 graph.
35 days.
job failed
New implicit graph approach.
Align all-vs-all
→ impg index (subgraphs)
→ alignment-based analysis (MAF)
MHC queried from impg
Polymorphic inversion on human 8p23.1
inversion
5’ flank
3’ flank
defensin tangle
Conservation scores show “hotspots” of fast evolution
Conservation scores broadly reflect constraint
ILS: Chr 6 (chm13 ref frame)
PanTro3 (P)
PanPan1 (P)
Hg002 (P)
GorGor1 (1)
PonPyg2 (1)
PonAbe1 (1)
O
G
H
CB
Dots = mean quartet voting for 100 kb window
Lines = sliding mean 3 Mb
Ref Length:
172,126,629
QI-sites: 616,403
(around purple)
ILS: Chr 6 (mPonAbe1 ref frame)
PanTro3 (P)
PanPan1 (P)
Hg002 (P)
GorGor1 (1)
PonPyg2 (1)
PonAbe1 (1)
O
G
H
CB
Dots = mean quartet voting for 100 kb window
Lines = sliding mean 3 Mb
Ref Length:
172,605,364
QI-sites: 634,168
(17,765 more)
Andrea Guarracino (pggb, wfmash, seqwish, odgi, chromosome communities)�Simon Heumos (pggb, odgi)�Flavia Villani (pggb, applications to mouse, popgen)�Njagi Mwaniki (wfmash, WFA applications)�Santiago Marco-Sola (WFA, wfmash)�Pjotr Prins (guidance, vcflib, vcfwave)�Richard Durbin (guidance)�Nicole Soranzo (guidance, support)�Benedict Paten (vgteam)�Hao Chen (rat, mouse)�Zhigui Bao (applications)�Lorenzo Tattini (yeast pangenomes)�Enza Colonna (applications to mouse, popgen)�Nadia Pisanti (algorithms)�Luca Pinello (applications)�Jennifer Gerton (robertsonians)�Adam Phillippy (robertsonians)�Peter Sudmant (primate pangenomes)�Robert Williams (guidance)
HPRC pangenomes working group and many others
funders include
Tennessee, NSF, NIH
Practical!
Let’s build some pangenome variation graphs with pggb!
First: a deeper dive into how the method works.
Then: we’ll work through small examples to learn how to drive it.
wfmash (biWFA)
PanGenome Graph Builder
seqwish (unbiased graph builder)
PanGenome Graph Builder
smoothxg (graph normalization)
PanGenome Graph Builder
wfmash makes initial alignments
the wavefront algorithm (WFA)
Needleman-Wunsch
high-order WFA (WF-lamba)
wfmash makes initial alignments
the wavefront algorithm (WFA)
Needleman-Wunsch
high-order bidirectional WFA (BiWFλ)
seqwish builds the graph
Path-guided stochastic gradient descent algorithm to optimize 1D order to best-match positions in embedded paths.
Pangenome graph with 12 ALT sequences of the HLA-DRB1 gene from the GRCh38 reference genome.
smoothxg organizes & normalizes the graph
… then we run MSA over this, locally
odgi helps us understand the pangenome
identity
position
orientation
copy number variation
2d layout
ODGI is meant to be a basic toolkit for interacting with pangenome graphs.
It uses the embedded genomes as references.
Putting it all together!
Test material today
Example: yeast chromosome 6
Yue, JX., Li, J., Aigrain, L. et al. Contrasting evolutionary genome dynamics between domesticated and wild yeasts. Nat Genet 49, 913–924 (2017). https://doi.org/10.1038/ng.3847
pafplot of
initial alignments from wfmash
a bit of the 2D layout
you have to zoom in in the web browser
path view
position view
inversion view
copy number view
TBC1D3
for Evan Eichler
chr17 in PGGB HPRCv1
sequence clipped by minigraph-cactus
annotation of clipped sequences in minigraph-cactus for A. thaliana pangenome