1 of 42

UShER:

pandemic-scale phylogenetics for SARS-CoV-2 and beyond

Angie Hinrichs, UCSC

ABPHM ‘23

2 of 42

Outline

  • A pandemic of genomes
  • UShER and the UShER SARS-CoV-2 tree
  • Applications of a mutation-annotated tree
  • What’s next

3 of 42

How many genomes?

4 of 42

How many genomes?!

5 of 42

How long to build a tree?

6 of 42

Outline

  • A pandemic of genomes
  • UShER and the UShER SARS-CoV-2 tree
  • Applications of a mutation-annotated tree
  • What’s next

7 of 42

UShER: Ultrafast Sample placement on Existing tRees

>sample_1

ATTAAAGGTTTATACCTTCCCAGGT

AACAAACCAACCAACTTTCGATCTC

TTGTAGATCTGTTCTCTAAACGAAC

TTTAAAATCTGTGTGGCTGTCACTC

nextstrain.org, 20 March 2020

8 of 42

UShER: what makes it Ultrafast?

  1. Full MSA → compact binary-encoded Mutation-Annotated Tree

  • Maximum likelihood estimation → parsimony
  • Utilize all the CPUs

Turakhia et al. Nature Genetics 2021. https://doi.org/10.1038/s41588-021-00862-7

G3179A, C6982T

C3037T, C14408T, A23403G

T28144C

C8986T

C8782T

C18060T

Yatish Turakhia

UCSC → UCSD

9 of 42

UShER: Ultrafast Sample placement on Existing tRees

Day 1: MSA + tree → Mutation-Annotated Tree (MAT)

G3179A, C6982T

C3037T, C14408T, A23403G

T28144C

C8986T

C8782T

C18060T

G3179A, C6982T

C3037T, C14408T, A23403G

T28144C

C8986T

C8782T

C18060T

A17858G

T26512C

G3179A, C6982T

C3037T, C14408T, A23403G

T28144C

C8986T

C8782T

C18060T

Day N+1: Day N MAT + MSA of only new sequences → new MAT

10 of 42

UShER web interface ~ https://usher.bio/

← Upload sequences

← Paste in names or IDs

or

11 of 42

UShER web interface: view subtree in Nextstrain

Get a Nextstrain subtree view of shared sequences most similar to your sequence

12 of 42

UShER web interface: check proposed Pango lineage

Pango lineage proposals often include an UShER subtree Nextstrain view screenshot

13 of 42

Growing the tree

GISAID

INSDC

COG-UK

CNCB

deduplicate

Yesterday’s

tree

diff

align

filter

filter

optimize

annotate

Today’s

tree

14 of 42

Growing the tree

15 of 42

matOptimize: making an okay tree better

14U

F

6G

4U

7U

9C

2U

10U

A

B

C

D

E

2U

Cheng Ye

UCSD

Yatish Turakhia

UCSD

14U

F

6G

4U

7U

9C

2U

10U

A

B

C

D

E

10U, 14U

Ye et al. Bioinformatics 2022 https://doi.org/10.1093/bioinformatics/btac401

16 of 42

Online Phylogenomics Databases

A

B

C

D

E

F

A

B

C

D

A

B

C

D

E

F

Addition Optimization

Repeat

Russ Corbett-Detig

UCSC

17 of 42

matUtils: extracting the full value from MATs

Jakob McBroome

UCSC

extract

VCF

Newick

JSON

summary

annotate

A

B

introduce

USA/CA → USA/TX

mask

uncertainty

merge

McBroome et al. Mol Biol Evol. 2021 https://doi.org/10.1093/molbev/msab264

18 of 42

Scaling challenges

usher-sampled!

multiple usher jobs on cluster

19 of 42

Scaling challenges

usher-sampled!

Cheng Ye

UCSD

20 of 42

Making Ultrafast even faster

usher-sampled:

  • Parallelized across samples → reduce redundant traversals
  • Vector instructions

Cheng Ye

UCSD

21 of 42

Outline

  • A pandemic of genomes
  • UShER and the UShER SARS-CoV-2 tree
  • Applications of a mutation-annotated tree
  • What’s next

22 of 42

pangolin’s UShER analysis mode

Full tree of ~14.5 million sequences

pangolin-data v1.19

68k consensus sequences

23 of 42

autolin

  • Information-theoretic identification of candidate lineages
  • Weight samples by relative representation
  • Weight mutations by effect

McBroome et al. bioRxiv 2023

https://doi.org/10.1101/2023.02.03.527052

Pango Autolin

Jakob McBroome

UCSC (graduated)

24 of 42

Taxonium viewer – cov2tree.org

Theo Sanderson

Crick Institute

Sanderson. eLife 2022

https://doi.org/10.7554/eLife.82392

25 of 42

Treenome Browser

Alex Kramer

UCSC

Kramer et al. Bioinformatics 2022 https://doi.org/10.1093/bioinformatics/btac772

26 of 42

ClusterTracker ~ https://clustertracker.gi.ucsc.edu/

McBroome et al. Virus Evolution 2022 https://doi.org/10.1093/ve/veac048

Jakob McBroome

UCSC

27 of 42

RIPPLES

Recombination Inference using Phylogenetic Placements

Yatish Turakhia

UCSD

Turakhia et al. Nature 2022 https://doi.org/10.1038/s41586-022-05189-9

28 of 42

RIVET: Recombination Viewer and Tracker

Kyle Smith

UCSD

Smith et al. bioRxiv 2023 https://doi.org/10.1101/2023.02.17.529036

29 of 42

Outline

  • A pandemic of genomes
  • UShER and the UShER SARS-CoV-2 tree
  • Applications of a mutation-annotated tree
  • What’s next

30 of 42

Beyond SARS-CoV-2

Other pathogens:

  • hMPXV
  • RSV-A, RSV-B
  • In the works: influenza, M. tuberculosis

Workflows, cloudification

Mycobacterium spp.

Lily Karim

UCSC

31 of 42

UShER in the future

  • Support indels, structural variants
  • Address reference bias with pangenome

Genome 2

Genome 2

Genome 3

Genome 3

Genome 4

Genome 1

Genome 1

Genome 1

Genome 2

Genome 3

Genome 4

Reference

BA.2

Ref

BA.1

BA.2

32 of 42

Acknowledgements

Team UShER (UC Santa Cruz):

  • Russ Corbett-Detig
  • Jakob McBroome
  • Alex Kramer
  • Bryan Thornlow
  • Nicolas Ayala
  • Adriano de Bernardi Schneider
  • Lily Karim
  • Koorous Vargha
  • Jeltje van Baren
  • Jen Martin
  • Marc Perry
  • David Haussler

Funding:

Team UShER (UC San Diego):

  • Yatish Turakhia
  • Cheng Ye
  • Kyle Smith
  • Sumit Walia
  • Devika Torvi
  • Shoh Mollenkamp

ANU: Rob Lanfear

EMBL-EBI Nick Goldman, Nicola de Maio

Crick Inst. Theo Sanderson

Genomes:

Pat & Rowland

Rebele

33 of 42

Thanks!

Questions? Email me:

angie@soe.ucsc.edu

34 of 42

UShER web interface needs Existing tRee

Through Nov. 2020:

Rob Lanfear’s publicly available, regularly updated sarscov2phylo tree of complete high-quality sequences from GISAID

After Nov. 2020:

No more public updates of sarscov2phylo tree.

→ sarscov2phylo + UShER!

35 of 42

Pre-SARS-CoV-2 phylogenomics

Day 1: MSA → tree

36 of 42

Pre-SARS-CoV-2 phylogenomics

Day 1: MSA → tree

Day N+1: Day N tree + MSA → new tree

37 of 42

Outline

  • A pandemic of genomes
  • UShER and the UShER SARS-CoV-2 tree
  • Applications of a mutation-annotated tree
  • Challenges of sequencing & assembly errors
  • What’s next

38 of 42

Tree scrambler: false reversions to reference

https://virological.org/t/764

39 of 42

Growing pains: sequencing errors, assembly errors

?!

~20,000 seqs

40 of 42

Bad batch

41 of 42

Bad batch

42 of 42

(Pango)-UShER is More Stable

PangoLEARN

pUShER

Software Release

Software Release

De Bernardi Schneider. In Prep.

78% Consistent

97% Consistent