UShER:
pandemic-scale phylogenetics for SARS-CoV-2 and beyond
Angie Hinrichs, UCSC
ABPHM ‘23
Outline
How many genomes?
How many genomes?!
How long to build a tree?
Outline
UShER: Ultrafast Sample placement on Existing tRees
>sample_1
ATTAAAGGTTTATACCTTCCCAGGT
AACAAACCAACCAACTTTCGATCTC
TTGTAGATCTGTTCTCTAAACGAAC
TTTAAAATCTGTGTGGCTGTCACTC
…
nextstrain.org, 20 March 2020
UShER: what makes it Ultrafast?
Turakhia et al. Nature Genetics 2021. https://doi.org/10.1038/s41588-021-00862-7
G3179A, C6982T
C3037T, C14408T, A23403G
T28144C
C8986T
C8782T
C18060T
Yatish Turakhia
UCSC → UCSD
UShER: Ultrafast Sample placement on Existing tRees
Day 1: MSA + tree → Mutation-Annotated Tree (MAT)
G3179A, C6982T
C3037T, C14408T, A23403G
T28144C
C8986T
C8782T
C18060T
G3179A, C6982T
C3037T, C14408T, A23403G
T28144C
C8986T
C8782T
C18060T
A17858G
T26512C
G3179A, C6982T
C3037T, C14408T, A23403G
T28144C
C8986T
C8782T
C18060T
Day N+1: Day N MAT + MSA of only new sequences → new MAT
UShER web interface ~ https://usher.bio/
← Upload sequences
← Paste in names or IDs
or
UShER web interface: view subtree in Nextstrain
Get a Nextstrain subtree view of shared sequences most similar to your sequence
UShER web interface: check proposed Pango lineage
Pango lineage proposals often include an UShER subtree Nextstrain view screenshot
Growing the tree
GISAID
INSDC
COG-UK
CNCB
deduplicate
Yesterday’s
tree
diff
align
filter
filter
optimize
annotate
Today’s
tree
Growing the tree
matOptimize: making an okay tree better
14U
F
6G
4U
7U
9C
2U
10U
A
B
C
D
E
2U
Cheng Ye
UCSD
Yatish Turakhia
UCSD
14U
F
6G
4U
7U
9C
2U
10U
A
B
C
D
E
10U, 14U
Ye et al. Bioinformatics 2022 https://doi.org/10.1093/bioinformatics/btac401
Online Phylogenomics Databases
A
B
C
D
E
F
A
B
C
D
A
B
C
D
E
F
Addition Optimization
Repeat
Russ Corbett-Detig
UCSC
matUtils: extracting the full value from MATs
Jakob McBroome
UCSC
extract
VCF
Newick
JSON
…
summary
annotate
A
B
introduce
USA/CA → USA/TX
mask
uncertainty
merge
McBroome et al. Mol Biol Evol. 2021 https://doi.org/10.1093/molbev/msab264
Scaling challenges
usher-sampled!
multiple usher jobs on cluster
Scaling challenges
usher-sampled!
Cheng Ye
UCSD
Making Ultrafast even faster
usher-sampled:
Cheng Ye
UCSD
Outline
pangolin’s UShER analysis mode
Full tree of ~14.5 million sequences
pangolin-data v1.19
68k consensus sequences
autolin
McBroome et al. bioRxiv 2023
https://doi.org/10.1101/2023.02.03.527052
Pango Autolin
Jakob McBroome
UCSC (graduated)
Taxonium viewer – cov2tree.org
Theo Sanderson
Crick Institute
Sanderson. eLife 2022
https://doi.org/10.7554/eLife.82392
Treenome Browser
Alex Kramer
UCSC
Kramer et al. Bioinformatics 2022 https://doi.org/10.1093/bioinformatics/btac772
ClusterTracker ~ https://clustertracker.gi.ucsc.edu/
McBroome et al. Virus Evolution 2022 https://doi.org/10.1093/ve/veac048
Jakob McBroome
UCSC
RIPPLES
Recombination Inference using Phylogenetic Placements
Yatish Turakhia
UCSD
Turakhia et al. Nature 2022 https://doi.org/10.1038/s41586-022-05189-9
RIVET: Recombination Viewer and Tracker
Kyle Smith
UCSD
Smith et al. bioRxiv 2023 https://doi.org/10.1101/2023.02.17.529036
Outline
Beyond SARS-CoV-2
Other pathogens:
Workflows, cloudification
Mycobacterium spp.
Lily Karim
UCSC
UShER in the future
Genome 2
Genome 2
Genome 3
Genome 3
Genome 4
Genome 1
Genome 1
Genome 1
Genome 2
Genome 3
Genome 4
Reference
BA.2
Ref
BA.1
BA.2
Acknowledgements
Team UShER (UC Santa Cruz):
Funding:
Team UShER (UC San Diego):
ANU: Rob Lanfear
EMBL-EBI Nick Goldman, Nicola de Maio
Crick Inst. Theo Sanderson
Genomes:
Pat & Rowland
Rebele
Thanks!
Questions? Email me:
angie@soe.ucsc.edu
UShER web interface needs Existing tRee
Through Nov. 2020:
Rob Lanfear’s publicly available, regularly updated sarscov2phylo tree of complete high-quality sequences from GISAID
After Nov. 2020:
No more public updates of sarscov2phylo tree.
→ sarscov2phylo + UShER!
Pre-SARS-CoV-2 phylogenomics
Day 1: MSA → tree
Pre-SARS-CoV-2 phylogenomics
Day 1: MSA → tree
Day N+1: Day N tree + MSA → new tree
Outline
Tree scrambler: false reversions to reference
https://virological.org/t/764
Growing pains: sequencing errors, assembly errors
?!
~20,000 seqs
Bad batch
Bad batch
(Pango)-UShER is More Stable
PangoLEARN
pUShER
Software Release
Software Release
De Bernardi Schneider. In Prep.
78% Consistent
97% Consistent