1 of 136

Building, understanding, and using pangenomes

Erik Garrison

University of Tennessee (UTHSC), Memphis

@Workshop on Genomics, Český Krumlov

January 13, 2025

2 of 136

Key model: variation graph = nodes, edges, paths

3 of 136

Key model: variation graph = nodes, edges, paths

Embeds genomes in a common sequence graph

4 of 136

New ideas often have a long history

St. Agatha pipetting a biosample into a nanopore sequencer

c. 1420

This all seems cool and “new” but ideas are rarely that.

Pangenomes and variation graphs have a long^† history.

(^†for genomics)

5 of 136

pipette

biosample

nanopore sequencer

6 of 136

nine versions Valerio Magrelli’s

poem “Campagna Romana” (1981)

variation graphs: not a new idea!

7 of 136

https://doi.org/10.1007/978-3-030-38281-0

Group B Streptococcus assemblies from 2002

Pangenome: not a new concept

First collections of multiple genomes from the same species demonstrated substantial differences.

This was unexpected and required new theory to understand.

A single reference is not enough to explain genomic diversity. Even many genomes may not be enough.

Some genes are shared among all individuals: these are “core”, while others are not—we call them “accessory”.

8 of 136

Lessons from language modeling: Heaps’ law

A pangenome is:

Closed: our observations of new genes with new genomes diminish.

Open: we continue to see new genes as we add more genomes.

https://doi.org/10.1007/978-3-030-38281-0

The exponent α determines whether the pangenome is open (α ≤ 1) or closed (α > 1). The top panel shows data for an open pangenome species, P. marinus; the bottom panel for a closed pangenome species, S. aureus

9 of 136

Pangenome research timeline

2000-2010s: counting genes

~2015: let’s take it to the sequence level (genome graphs)

2020s: complete assemblies (T2T pangenomes)

10 of 136

github.com/vgteam/vg

“Computational pan-genomics: status, promises and challenges” https://doi.org/10.1093%2Fbib%2Fbbw089

11 of 136

yup… we can generalize most standard bioinformatic algorithms to graphs, as in Partial Order Alignment →

Wait! You can align sequences to graphs?

https://doi.org/10.1093/bioinformatics/18.3.452

12 of 136

reads aligned to a variation graph

13 of 136

14 of 136

vg resolves reference bias at known indels in HG002

50x 2x150bp Illumina sequencing of HG002

15 of 136

16 of 136

Using variation graphs to observe CCR5-delta in ancient samples

Rui Martiniano

17 of 136

18 of 136

Exonic deletion in the HGSVC dataset correctly genotyped by vg

reads vs. graph

reads vs. GRCh38

19 of 136

20 of 136

vg giraffe: approach

21 of 136

vg giraffe

is accurate

enough

22 of 136

vg giraffe is very fast

23 of 136

vg giraffe

improves

variant calling

24 of 136

vg giraffe lets us scale: PCA from SVs in 5k genomes

25 of 136

Building a draft

human pangenome

26 of 136

27 of 136

28 of 136

Q1 2025

29 of 136

Sample selection was constrained by:

trio status in Coriell biobank (-Europeans)
low cell line passage count (--Europeans)
genetic diversity (+++Africans)
drift (+Asians, ++Americas)

Draft pangenome composition

30 of 136

Amazing assemblies approach reference quality

Haplotype-resolved assemblies from trio-hifiasm.

They are really good, according to realignment of reads to the assemblies and model of assembly completeness—nearly as good as T2T-CHM13!

Mobin Asri

31 of 136

Then we made 5 pangenome (reference) graphs…

32 of 136

Minigraph

https://doi.org/10.1186/s13059-020-02168-z

33 of 136

Minigraph-Cactus

https://doi.org/10.1038/s41587-023-01793-w

34 of 136

https://doi.org/10.1101/2023.04.05.535718

35 of 136

Key conceptual differences between HPRC pangenome construction methods

minigraph: just SVs, no complex stuff, one reference.

minigraph-cactus: add SNPs, clean up the breakpoints, useful for alignment, one reference.

pggb: everything-vs-everything, hard to align to, useful for studying evolution and pangenome structure at all scales, all genomes are references.

“Collapse” in high-copy repeats →

36 of 136

minigraph-Cactus creates a hierarchical pangenome rooted in the reference genome, ensuring compatibility with standard tools.

37 of 136

PGGB creates graphs in which each genome can act as a reference, so we choose our reference as needed by later analysis or work in graph space.

38 of 136

Why do we need a human pangenome?

T2T-CHM13 adds ~200MB of heterochromatin to reference.

Draft pangenome adds ~100MB of polymorphic euchromatin (and a lot more heterochromatin),

0.6-4.4 Mb of additional genic sequences per haplotype compared to GRCh38 (38 gene CNVs/haplotype).

Pangenome growth (PGGB)

39 of 136

PGGB: all chromosomes, layout with path-guided SGD

https://doi.org/10.1101/2023.09.22.558964

Simon Heumos

40 of 136

C4A/B in pggb graph

MHC class II

6p

MHC

centromere

41 of 136

C4A/B in pggb graph

Christian Fischer (UTHSC)

MHC class II

https://github.com/chfi/gfaestus

Genbank annotations on top of the pggb HPRC graph.

42 of 136

C4A/B in pggb graph

Christian Fischer (UTHSC)

C4A/B

MHC class II

https://github.com/chfi/gfaestus

Genbank annotations on top of the pggb HPRC graph.

43 of 136

C4A/B in pggb graph

Christian Fischer (UTHSC)

copy number is related to schizophrenia risk

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4752392/

44 of 136

We learn that genome evolution is often nonlinear

complement component 4 locus

Large SVs predominantly occur at VNTRs which are simply loops in our pggb graphs.

45 of 136

We learn that genome evolution is often nonlinear

C4A

C4B

46 of 136

Amylase: how humans evolved a taste for agriculture

47 of 136

48 of 136

Amylase breaks starch into simpler sugars

48

49 of 136

Amylase is a multi-copy gene family

50 of 136

Human amylase copy number diversity

50

28 amylase structural haplotypes

11 consensus structures

51 of 136

Deconvolving haplotypes from short reads

51

52 of 136

Recent evolution from 533 ancient European genomes

Data from Allentoft et al. 2024, Nature

53 of 136

Evidence for selection of high-copy amylase haplotypes

Allele frequency changes over last ~12,000 years in eurasian populations.

54 of 136

Evidence for selection of high-copy amylase haplotypes

Allele frequency changes over last ~12,000 years in eurasian populations.

55 of 136

Bolognini*, Halgren*, Lou*, Raveane*, Rocha* et al., Nature

56 of 136

Recombination between heterologous acrocentric chromosomes

57 of 136

58 of 136

Workflow

https://github.com/human-pangenomics/HPP_Year1_Assemblies

HPRC assemblies

Mapping against the whole CHM13

PanGenome Graph Builder (PGGB)

https://github.com/pangenome/pggb

Acrocentric contigs covering (+/- 1Mbp) both the p and q arms (pq-contigs)

+ HG002 contigs >= 300kbps� which map to acrocentrics

59 of 136

60 of 136

Untangling the pangenome graph

We look into the graph from the perspective of chromosome 13. Full information from pangenome plus reference annotations.

61 of 136

Untangling the pangenome graph

We look into the graph from the perspective of chromosome 13. Full information from pangenome plus reference annotations.

62 of 136

Untangling the pangenome graph

We look into the graph from the perspective of chromosome 13. Full information from pangenome plus reference annotations.

63 of 136

Untangling the pangenome graph

We look into the graph from the perspective of chromosome 13. Full information from pangenome plus reference annotations.

64 of 136

64

Pseudo Homologous Regions

65 of 136

14

21

FISH probes from GRC clones

Guarracino et al., 2023, Nature

66 of 136

Complete sequence of Robertsonian chromosomes

de Lima*, Guarracino* et al., in preparation

GM03786

GM04890

t(13;14)

t(14;21)

GM03417

67 of 136

Pseudo-homologous regions (PHRs)

Guarracino et al., 2023, Nature

68 of 136

https://doi.org/10.1007/BF02910454

69 of 136

https://doi.org/10.1016/j.ajog.2004.02.062

Why was it not possible to see this before using traditional laboratory techniques?

← 300 FISHing experiments to find one heterologous synapse.

70 of 136

71 of 136

Implicit pangenomes

72 of 136

Heumos et. al. https://doi.org/10.1093/bioinformatics/btae363

Pangenome graph of the HPRCy1 including 88 haplotypes.

Represents all variation of all types between all genomes.

73 of 136

Pangenome graphs = Pangenome alignments

Alignments can be represented as sets of matches, which are just pairs of positions in two genomes.

Garrison & Guarracino 2022 https://doi.org/10.1093/bioinformatics/btac743

74 of 136

Pangenome graphs = Pangenome alignments

To build the graph, we condense matched genome letters into single nodes.

Garrison & Guarracino 2022 https://doi.org/10.1093/bioinformatics/btac743

75 of 136

Pangenome graphs = Pangenome alignments

Variation graph represents all alignments and all genomes.

Garrison & Guarracino 2022 https://doi.org/10.1093/bioinformatics/btac743

76 of 136

Is this the end of genomics?

One model:

all parts of all genomes
all types of variation
genomes of different species, why not!
put it all in there!

GENOME

ALL

Y’ALL

77 of 136

Is this the end of genomics?

One model:

all parts of all genomes
all types of variation
genomes of different species, why not!
put it all in there!

But… there are problems…

GENOME

ALL

Y’ALL

78 of 136

1.05 Terabytes

One chromosome.

79 of 136

You’re not alone agolicz.

80 of 136

You’re not alone Wwwwwwwyc.

81 of 136

😭😭😭

Don’t cry, wjwei-handsome

82 of 136

83 of 136

84 of 136

85 of 136

Building pangenome graphs is hard…

86 of 136

Why are graphs hard?

Graphs are “sticky”...

Anything can connect to anything else.

Nonlinear relationships = you never know what you need to know.

That means you need to remember everything, all at once.

87 of 136

==

S. cerevisiae pangenome graph

graph adjacency matrix

(node:node edges)

from Yue et. al. 2017 https://doi.org/10.1038/ng.3847

88 of 136

Why are graphs hard?

Graphs are “sticky”...

Anything can connect to anything else.

Nonlinear relationships = you never know what you need to know.

That means you need to remember everything, all at once.

89 of 136

Why are graphs hard?

Graphs are “sticky”...

Anything can connect to anything else.

Nonlinear relationships = you never know what you need to know.

That means you need to remember everything, all at once.

We cheated when we made the HPRCy1 graphs.

Yup.

We broke it up by chromosome.

🙈

🙊

90 of 136

Why does it matter if we cheated?

Guarracino et al. 2023 is why.

Heterologous acrocentric chromosomes recombine.

Recombination occurs here

91 of 136

Let’s overcomplicate things

→

alignments

pangenome graph

… “untangling”

from Guarracino et al. 2023.

92 of 136

Let’s overcomplicate things

→

alignments

pangenome graph

untangling means

… alignments?

93 of 136

The graph vanishes

→

alignments

pangenome graph

… alignments?

X

94 of 136

Implicit data structures

Useful pangenome queries:

liftover (coordinate projection)
subgraph extraction (can be used to divide-and-conquer graph build)
local haplotype matching, homology (untangling)
variant calling
conservation
incomplete lineage sorting

But, we want to avoid building the whole graph up front to get these.

We can obtain these with alignments and an implicit representation of the graph.

95 of 136

Implicit interval tree

https://doi.org/10.1093/bioinformatics/btaa827

100 150

130 200

170 300

180 250

200 250

250 350

270 300

300 320

350 450

390 420

==

IIT

96 of 136

IMPG: IMplicit Pangenome Graphs

Alignment [start, end) → intervals → implicit interval tree (per genome)

wfmash

https://github.com/pangenome/impg

https://github.com/waveygang/wfmash

97 of 136

Querying impg

Collect ranges overlapping a target range.

98 of 136

Querying impg

Collect ranges overlapping a target range.

99 of 136

Querying impg

Collect ranges overlapping a target range.

100 of 136

Querying impg

Collect ranges overlapping a target range.

101 of 136

Querying impg

Collect ranges overlapping a target range.
For each collected range, walk the cigar to translate coordinates, and goto (1) …

102 of 136

Querying impg

Collect ranges overlapping a target range.
For each collected range, walk the cigar to translate coordinates, and goto (1) …

103 of 136

Querying impg

Collect ranges overlapping a target range.
For each collected range, walk the cigar to translate coordinates, and goto (1) …

104 of 136

Querying impg

Collect ranges overlapping a target range.
For each collected range, walk the cigar to translate coordinates, and goto (1) …

105 of 136

Querying impg

Collect ranges overlapping a target range.
For each collected range, goto (1) …
Deduplicate

Result is the “transitive closure” of aligned ranges in pangenome. If graph == alignments, a subgraph query.

106 of 136

Telomere-to-telomere great ape assemblies

Homo sapiens (human)

Symphalangus syndactylus (siamang gibbon)

Gorilla gorilla (gorilla)

Pongo pygmaeus (Bornean orangutan)

Pongo abelii (Sumatran orangutan)

Pan troglodytes (chimpanzee)

Pan paniscus (bonobo)

phylogeny from PGGB graph on hsa_chr6

107 of 136

Telomere-to-telomere great ape assemblies

Homo sapiens (human)

Symphalangus syndactylus (siamang gibbon)

Gorilla gorilla (gorilla)

Pongo pygmaeus (Bornean orangutan)

Pongo abelii (Sumatran orangutan)

Pan troglodytes (chimpanzee)

Pan paniscus (bonobo)

phylogeny from PGGB graph on hsa_chr6

108 of 136

Telomere-to-telomere great ape assemblies

Homo sapiens (human)

Symphalangus syndactylus (siamang gibbon)

Gorilla gorilla (gorilla)

Pongo pygmaeus (Bornean orangutan)

Pongo abelii (Sumatran orangutan)

Pan troglodytes (chimpanzee)

Pan paniscus (bonobo)

phylogeny from PGGB graph on hsa_chr6

1 graph.

35 days.

job failed

109 of 136

Telomere-to-telomere great ape assemblies

Homo sapiens (human)

Symphalangus syndactylus (siamang gibbon)

Gorilla gorilla (gorilla)

Pongo pygmaeus (Bornean orangutan)

Pongo abelii (Sumatran orangutan)

Pan troglodytes (chimpanzee)

Pan paniscus (bonobo)

phylogeny from PGGB graph on hsa_chr6

1 graph.

35 days.

job failed

New implicit graph approach.

Align all-vs-all

→ impg index (subgraphs)

→ alignment-based analysis (MAF)

110 of 136

MHC queried from impg

111 of 136

Polymorphic inversion on human 8p23.1

inversion

5’ flank

3’ flank

defensin tangle

112 of 136

Conservation scores show “hotspots” of fast evolution

113 of 136

Conservation scores broadly reflect constraint

114 of 136

ILS: Chr 6 (chm13 ref frame)

PanTro3 (P)

PanPan1 (P)

Hg002 (P)

GorGor1 (1)

PonPyg2 (1)

PonAbe1 (1)

O

G

H

CB

Dots = mean quartet voting for 100 kb window

Lines = sliding mean 3 Mb

Ref Length:

172,126,629

QI-sites: 616,403

(around purple)

115 of 136

ILS: Chr 6 (mPonAbe1 ref frame)

PanTro3 (P)

PanPan1 (P)

Hg002 (P)

GorGor1 (1)

PonPyg2 (1)

PonAbe1 (1)

O

G

H

CB

Dots = mean quartet voting for 100 kb window

Lines = sliding mean 3 Mb

Ref Length:

172,605,364

QI-sites: 634,168

(17,765 more)

116 of 136

Andrea Guarracino (pggb, wfmash, seqwish, odgi, chromosome communities)�Simon Heumos (pggb, odgi)�Flavia Villani (pggb, applications to mouse, popgen)�Njagi Mwaniki (wfmash, WFA applications)�Santiago Marco-Sola (WFA, wfmash)�Pjotr Prins (guidance, vcflib, vcfwave)�Richard Durbin (guidance)�Nicole Soranzo (guidance, support)�Benedict Paten (vgteam)�Hao Chen (rat, mouse)�Zhigui Bao (applications)�Lorenzo Tattini (yeast pangenomes)�Enza Colonna (applications to mouse, popgen)�Nadia Pisanti (algorithms)�Luca Pinello (applications)�Jennifer Gerton (robertsonians)�Adam Phillippy (robertsonians)�Peter Sudmant (primate pangenomes)�Robert Williams (guidance)

HPRC pangenomes working group and many others

funders include

Tennessee, NSF, NIH

117 of 136

Practical!

Let’s build some pangenome variation graphs with pggb!

First: a deeper dive into how the method works.

Then: we’ll work through small examples to learn how to drive it.

118 of 136

119 of 136

wfmash (biWFA)

PanGenome Graph Builder

120 of 136

seqwish (unbiased graph builder)

PanGenome Graph Builder

121 of 136

smoothxg (graph normalization)

PanGenome Graph Builder

122 of 136

wfmash makes initial alignments

the wavefront algorithm (WFA)

Needleman-Wunsch

high-order WFA (WF-lamba)

123 of 136

wfmash makes initial alignments

the wavefront algorithm (WFA)

Needleman-Wunsch

high-order bidirectional WFA (BiWFλ)

124 of 136

seqwish builds the graph

125 of 136

Path-guided stochastic gradient descent algorithm to optimize 1D order to best-match positions in embedded paths.

Pangenome graph with 12 ALT sequences of the HLA-DRB1 gene from the GRCh38 reference genome.

smoothxg organizes & normalizes the graph

… then we run MSA over this, locally

126 of 136

odgi helps us understand the pangenome

https://www.biorxiv.org/content/10.1101/2021.11.10.467921v1

identity

position

orientation

copy number variation

2d layout

ODGI is meant to be a basic toolkit for interacting with pangenome graphs.

It uses the embedded genomes as references.

127 of 136

https://doi.org/10.1101/2023.04.05.535718

Putting it all together!

128 of 136

129 of 136

Test material today

A few genes from HLA-D (MHC class II) in humans — getting started

https://github.com/pangenome/pggb-workshop/tree/evomics2025

Yeast chromosome 6 — scaling up

~/workshop_materials/pangenomics/cerevisiae.chrV.fa
you will want to apply samtools faidx to this… pggb will warn you

Whole yeast chromosomes — looking at chromosome variation

~/workshop_materials/pangenomics/cerevisiae.pan.fa.gz

130 of 136

Example: yeast chromosome 6

Yue, JX., Li, J., Aigrain, L. et al. Contrasting evolutionary genome dynamics between domesticated and wild yeasts. Nat Genet 49, 913–924 (2017). https://doi.org/10.1038/ng.3847

pafplot of

initial alignments from wfmash

131 of 136

a bit of the 2D layout

you have to zoom in in the web browser

132 of 136

path view

position view

inversion view

copy number view

133 of 136

134 of 136

TBC1D3

for Evan Eichler

chr17 in PGGB HPRCv1

135 of 136

sequence clipped by minigraph-cactus

136 of 136

annotation of clipped sequences in minigraph-cactus for A. thaliana pangenome